February 19, 2026
How We Built Tinybird's TypeScript SDK for ClickHouse
Faster PlanetScale Postgres connections with Cloudflare Hyperdrive
February 18, 2026
OSTEP Chapter 9: Proportional Share Scheduling
The Crux: Fairness Over Speed. Unlike the schedulers we explored in Chapter 8 (like Shortest Job First or Multi-Level Feedback Queues) that optimize for "turnaround time" or "response time", proportional-share schedulers introduced in this Chapter aim to guarantee that each job receives a specific percentage of CPU time.
(This is part of our series going through OSTEP book chapters.)
Basic Concept: Tickets
Lottery Scheduling serves as the foundational example of proportional-share schedulers. It uses a randomized mechanism to achieve fairness probabilistically. The central concept of Lottery Scheduling is the ticket. Tickets represent the share of the resource a process should receive.
The scheduler holds a lottery every time slice. If Job A has 75 tickets and Job B has 25 (100 total), the scheduler picks a random number between 0 and 99. Statistically, Job A will win 75% of the time. The implementation is incredibly simple. It requires a random number generator, a list of processes, and a loop that sums ticket values until the randomly picked counter (300 in the below example) exceeds the winning number.
Advanced Ticket Mechanisms
1. Ticket Currency: Users can allocate tickets among their own jobs in local currency (e.g., 500 "Alice-tickets"), which the system converts to global currency. This delegates the "fairness" decision to the user.
2. Ticket Transfer: A client can temporarily hand its tickets to a server process to maximize performance while a specific request is being handled.
3. Ticket Inflation: In trusted environments, a process can unilaterally boost its ticket count to reflect a higher need for CPU. In competitive settings this is unsafe, since a greedy process could grant itself excessive tickets and monopolize the machine. In practice, modern systems prevent this with control groups (cgroups), which act as an external regulator that assigns fixed resource weights so untrusted processes cannot simply print more tickets to override the scheduler.
Lottery Scheduling depends on randomness to decide which job runs next. This randomness helps avoid the tricky cases that can trip up traditional algorithms, like LRU on cyclic workloads, and keeps the scheduler simple with minimal state to track. However, fairness is only achieved over time. In the short term, a job might get unlucky and lose more often than its share of tickets. Studies show that fairness is low for short jobs and only approaches perfect fairness as the total runtime increases.
Lottery vs. Stride vs. CFS scheduling
Stride Scheduling emerged to address the probabilistic quirks of Lottery Scheduling. It assigns each process a stride, inversely proportional to its tickets, and maintains a pass value tracking how much CPU time the process has received. At each decision point, the scheduler selects the process with the lowest pass value.
This guarantees exact fairness each cycle, but it introduces challenges with global state. When a new process arrives, assigning it a fair initial pass value is tricky: set it too low, and it can dominate the CPU; too high, and it risks starvation. In contrast, Lottery Scheduling handles new arrivals seamlessly, since it requires no global state.
The Linux Completely Fair Scheduler (CFS) builds on these earlier proportional schedulers but removes randomness by using Virtual Runtime (vruntime) to track each process’s CPU usage. At every scheduling decision, CFS selects the job with the smallest vruntime, ensuring a fair distribution of CPU time. To prevent excessive context-switching overhead when there are many tasks (each receiving only a tiny slice of CPU time), CFS enforces a min_granularity. This ensures every process runs for at least a minimum time slice, and it balances fairness with efficient CPU utilization.
To prioritize specific processes, CFS uses the classic UNIX "nice" level, which allows users to assign values between -20 (highest priority) and +19 (lowest priority). CFS maps these values to geometric weights; a process with a higher priority (lower nice value) is assigned a larger weight. This weight directly alters the rate at which vruntime accumulates: high-priority processes add to their vruntime much more slowly than low-priority ones. When determining exactly how long a process should run within the target scheduling latency (sched_latency), instead of simply dividing the target latency equally among all tasks (e.g., 48ms/n), CFS calculates the time slice for a specific process k as a fraction of the total weight of all currently running processes.
Consequently, a high-priority job can run for a longer physical time while only "charging" a small amount of virtual time, allowing it to claim a larger proportional share of the CPU compared to "nicer" low-priority tasks.
Finally, because modern systems handle thousands of processes, CFS replaces the simple lists of Lottery Scheduling with Red-Black Trees, giving $O(\log n)$ efficiency for insertion and selection.
Challenges in Proportional Sharing
The I/O Problem. Proportional schedulers face a challenge when jobs sleep, such as waiting for I/O. In a straightforward model, a sleeping job lags behind, and when it resumes, it can monopolize the CPU to catch up, potentially starving other processes. CFS addresses this by resetting the waking job’s vruntime to the minimum value in the tree. This ensures no process starves, but it can penalize the interactive job, leading to slower response times.
The Ticket Assignment Problem. Assigning tickets is still an open challenge. In general-purpose computing, such as browsers or editors, it’s unclear how many tickets each application deserves, making fairness difficult to enforce. The situation is a bit more clear in virtualization and cloud computing, where ticket allocation aligns naturally with resource usage: if a client pays for 25% of a server, it can be assigned 25% of the tickets, providing a clear and effective proportional share.
I’ve been experimenting with Gemini Pro and NotebookLM. What a time to be alive! The fancy slides above all came from these tools, and for the final summary infographic, it produced a solid visual mind map of everything. Decades ago, as a secondary school student, I created similar visual mind maps as mnemonic devices for exams. The Turkish education system relied heavily on memorization, but I needed to understand concepts first to be able to memorize them. So I connected ideas, contextualized them through these visual mind maps, and it worked wonders. I even sold these mind maps to friends before exams. Looking back, those experiences were formative for my later blogging and explanation efforts. Fun times.
Explaining why throughput varies for Postgres with a CPU-bound Insert Benchmark
Throughput for the write-heavy steps of the Insert Benchmark look like a distorted sine wave with Postgres on CPU-bound workloads but not on IO-bound workloads. For the CPU-bound workloads the chart for max response time at N-second intervals for inserts is flat but for deletes it looks like the distorted sine wave. To see the chart for deletes, scroll down from here. So this looks like a problem for deletes and this post starts to explain that.
tl;dr
- Once again, blame vacuum
History of the Insert Benchmark
Long ago (prior to 2010) the Insert Benchmark was published by Tokutek to highlight things that the TokuDB storage engine was great at. I was working on MySQL at Google at the time and the benchmark was useful to me, however it was written in C++. While the Insert Benchmark is great at showing the benefits of an LSM storage engine, this was years before MyRocks and I was only doing InnoDB at the time, on spinning disks. So I rewrote it in Python to make it easier to modify, and then the Tokutek team improved a few things about my rewrite, and I have been enhancing it slowly since then.
Until a few years ago the steps of the benchmark were:
- load - insert in PK order
- create 3 secondary indexes
- do more inserts as fast as possible
- do rate-limited inserts concurrent with range and point queries
- load - insert in PK order
- create 3 secondary indexes
- do inserts+deletes at the same rate, as fast as possible
- do rate-limited inserts+deletes at the same rate concurrent with range and point queries
delete from %s where transactionid in
I assume the sine wave for delete response time is caused by one or both of:
- get_actual_varable_range() CPU overhead while planning the delete statement
- CPU overhead from scanning and skipping tombstones while executing the select subquery
MariaDB innovation: binlog_storage_engine, small server, Insert Benchmark
MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).
My previous post had results for sysbench with a small server. This post has results for the Insert Benchmark with a similar small server. Both servers use an SSD that has has high fsync latency. This is probably a best-case comparison for the feature. If you really care, then get enterprise SSDs with power loss protection. But you might encounter high fsync latency on public cloud servers.
tl;dr for a CPU-bound workload
- Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
- When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)
- Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
- When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)
The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.
- z12b
- my.cnf.cz12b_c8r32 is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
- z12c
- my.cnf.cz12c_c8r32 is like z12b except it enables binlog_storage_engine
- z12b_sync
- my.cnf.cz12b_sync_c8r32 is like z12b except it enables sync-on-commit for the binlog and InnoDB
- z12c_sync
- my.cnf.cz12c_sync_c8r32 is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
- CPU-bound - the database is cached by InnoDB, but there is still much write IO
- IO-bound - most, but not all, benchmark steps are IO-bound
- l.i0
- insert XM rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client. X is 30M for CPU-bound and 800M for IO-bound.
- l.x
- create 3 secondary indexes per table. There is one connection per client.
- l.i1
- use 2 connections/client. One inserts XM rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate. X is 40M for CPU-bound and 4M for IO-bound.
- l.i2
- like l.i1 but each transaction modifies 5 rows (small transactions) and YM rows are inserted and deleted per table. Y is 10M for CPU-bound and 1M for IO-bound.
- Wait for S seconds after the step finishes to reduce MVCC GC debt and perf variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
- qr100
- use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload. This step runs for 1800 seconds.
- qp100
- like qr100 except uses point queries on the PK index
- qr500
- like qr100 but the insert and delete rates are increased from 100/s to 500/s
- qp500
- like qp100 but the insert and delete rates are increased from 100/s to 500/s
- qr1000
- like qr100 but the insert and delete rates are increased from 100/s to 1000/s
- qp1000
- like qp100 but the insert and delete rates are increased from 100/s to 1000/s
- CPU-bound
- all-versions - results for z12b, z12c, z12b_sync and z12c_sync
- sync-only - results for z12b_sync vs 12c_sync
- IO-bound
- all-versions - results for z12b, z12c, z12b_sync and z12c_sync
- sync-only - results for z12b_sync vs 12c_sync
When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:
- insert/s for l.i0, l.i1, l.i2
- indexed rows/s for l.x
- range queries/s for qr100, qr500, qr1000
- point queries/s for qp100, qp500, qp1000
- Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2.
- When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 1.75X larger for l.i0 (load) and 4X or more larger on the random write steps (l.i1, l.i2)
- all-versions: (QPS for my config / QPS for z12b)
- sync-only: (QPS for my z12c / QPS for z12b)
| dbms | l.i0 | l.x | l.i1 | l.i2 | qr100 | qp100 | qr500 | qp500 | qr1000 | qp1000 |
|---|---|---|---|---|---|---|---|---|---|---|
| ma120300_rel_withdbg.cz12b_c8r32 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| ma120300_rel_withdbg.cz12c_c8r32 | 1.03 | 1.01 | 1.00 | 1.03 | 1.00 | 0.99 | 1.00 | 1.00 | 1.01 | 1.00 |
| ma120300_rel_withdbg.cz12b_sync_c8r32 | 0.04 | 1.02 | 0.07 | 0.01 | 1.01 | 1.01 | 1.00 | 1.01 | 1.00 | 1.00 |
| ma120300_rel_withdbg.cz12c_sync_c8r32 | 0.08 | 1.03 | 0.28 | 0.06 | 1.02 | 1.01 | 1.01 | 1.02 | 1.02 | 1.01 |
| dbms | l.i0 | l.x | l.i1 | l.i2 | qr100 | qp100 | qr500 | qp500 | qr1000 | qp1000 |
|---|---|---|---|---|---|---|---|---|---|---|
| ma120300_rel_withdbg.cz12b_sync_c8r32 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| ma120300_rel_withdbg.cz12c_sync_c8r32 | 1.75 | 1.01 | 3.99 | 6.83 | 1.01 | 1.01 | 1.01 | 1.01 | 1.03 | 1.01 |
- Enabling sync on commit for InnoDB and the binlog has a large impact on throughput for the write-heavy steps -- l.i0, l.i1 and l.i2. It also has a large impact on qp1000, which is the most write-heavy of the query+write steps.
- When sync on commit is enabled, then also enabling the binlog_storage_engine is great for performance as throughput on the write-heavy steps is 4.74X larger for l.i0 (load), 1.50X larger for l.i1 (random writes) and 2.99X larger for l.i2 (random writes)
- all-versions: (QPS for my config / QPS for z12b)
- sync-only: (QPS for my z12c / QPS for z12b)
| dbms | l.i0 | l.x | l.i1 | l.i2 | qr100 | qp100 | qr500 | qp500 | qr1000 | qp1000 |
|---|---|---|---|---|---|---|---|---|---|---|
| ma120300_rel_withdbg.cz12b_c8r32 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| ma120300_rel_withdbg.cz12c_c8r32 | 1.01 | 0.99 | 0.99 | 1.01 | 1.01 | 1.01 | 1.01 | 1.07 | 1.01 | 1.04 |
| ma120300_rel_withdbg.cz12b_sync_c8r32 | 0.04 | 1.00 | 0.55 | 0.10 | 1.02 | 0.97 | 1.00 | 0.80 | 0.95 | 0.55 |
| ma120300_rel_withdbg.cz12c_sync_c8r32 | 0.18 | 1.00 | 0.83 | 0.31 | 1.02 | 1.01 | 1.02 | 0.96 | 1.02 | 0.86 |
| dbms | l.i0 | l.x | l.i1 | l.i2 | qr100 | qp100 | qr500 | qp500 | qr1000 | qp1000 |
|---|---|---|---|---|---|---|---|---|---|---|
| ma120300_rel_withdbg.cz12b_sync_c8r32 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| ma120300_rel_withdbg.cz12c_sync_c8r32 | 4.74 | 1.00 | 1.50 | 2.99 | 1.00 | 1.04 | 1.02 | 1.20 | 1.08 | 1.57 |
February 17, 2026
Use default encryption at rest for new Amazon Aurora clusters
An Open Letter to Oracle: Let’s Talk About MySQL’s Future
February 16, 2026
MariaDB innovation: binlog_storage_engine
MariaDB 12.3 has a new feature enabled by the option binlog_storage_engine. When enabled it uses InnoDB instead of raw files to store the binlog. A big benefit from this is reducing the number of fsync calls per commit from 2 to 1 because it reduces the number of resource managers from 2 (binlog, InnoDB) to 1 (InnoDB).
In this post I have results for the performance benefit from this when using storage that has a high fsync latency. This is probably a best-case comparison for the feature. A future post will cover the benefit on servers that don't have high fsync latency.
tl;dr
- the performance benefit from this is excellent when storage has a high fsync latency
- my mental performance model needs to be improved. I gussed that throughput would increase by ~2X when using binlog_storage_engine relative to not using it but using sync_binlog=1 and innodb_flush_log_at_trx_commit=1. However the improvement is larger than 4X.
The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. The storage device has high fsync latency.
- z12b
- my.cnf.cz12b_c8r32 is my default configuration. Sync-on-commit is disabled for both the binlog and InnoDB so that write-heavy benchmarks create more stress.
- z12c
- my.cnf.cz12c_c8r32 is like z12b except it enables binlog_storage_engine
- z12b_sync
- my.cnf.cz12b_sync_c8r32 is like z12b except it enables sync-on-commit for the binlog and InnoDB
- z12c_sync
- my.cnf.cz12c_sync_c8r32 is like cz12c except it enables sync-on-commit for InnoDB. Note that InnoDB is used to store the binlog so there is nothing else to sync on commit.
Benchmark
The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.
The benchmark is run with 1 client, 1 table and 50M rows.
But here I only report results for the write-heavy tests.
I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
I present results for:
- z12b, z12c, z12b_sync and z12c_sync with z12b as the base version
- z12b_sync and z12c_sync with z12b_sync as the base version
- z12b_sync has the worst performance thanks to 2 fsyncs per commit
- z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and in the next section I mention that enabling binlog_storage_engine also reduces the CPU overhead.
- some per-test data from iostat and vmstat is here
- a representative sample of iostat collected at 1-second intervals during the update-inlist test is here. When comparing z12b_sync with z12c_sync
- the fsync rate (f/s) is ~2.5X larger for z12c_sync vs z12b_sync (~690/s vs ~275/s) but fsync latency (f_await) is similar. So with binlog_storage_engine enabled MySQL is more efficient, and perhaps thanks to a lower CPU overhead, there is less work to do in between calls to fsync
- z12c_sync gets more than 4X the throughput vs z12b_sync. If fsync latency were the only thing that determined performance then I would expect the difference to be ~2X. There is more going on here and below I mention that enabling binlog_storage_engine also reduces the CPU overhead.
- some per-test data from iostat and vmstat is here and the CPU overhead per operation is much smaller with binlog_storage_engine -- see here for the update-inlist test. In general, when sync-on-commit is enabled then the CPU overhead with binlog_storage_engine enabled is between 1/3 and 2/3 of the overhead without it enabled.
Relational composition and Codd's "connection trap" in PostgreSQL and MongoDB
Relational composition is to joins what the cartesian product is to tables: it produces every result that could be true, not just what is true. This often leads to SQL mistakes and can often be suspected when a SELECT DISTINCT is added after a query starts returning more rows than expected, without the root cause being understood.
In its mathematical definition, relational composition is the derived relation obtained by existentially joining two relations on a shared attribute and projecting away that attribute. In a database, it is meaningful only when a real‑world invariant ensures that the resulting pairs reflect actual facts. Otherwise, the result illustrates what E. F. Codd, in his 1970 paper A Relational Model of Data for Large Shared Data Banks, called the connection trap.
Codd uses two relations in his example: a supplier supplies parts, and a project uses parts. At an intuitive level, this connection trap mirrors a syllogism: if a supplier supplies a part and a project uses that part, a join can derive that the supplier supplies the project—even when that conclusion was never stated as a fact.
Codd observed that the connection trap was common in pre‑relational network data models, where users navigated data by following physical pointers. Path existence was often mistaken for semantic relationship. The relational model solved this problem by replacing navigational access with declarative queries over explicitly defined relations, and modern document models now do the same.
However, while the relational model removes pointer‑based navigation, it does not eliminate the trap entirely. Joins can still compute relational compositions, and without appropriate cardinality constraints or business invariants, such compositions may represent only possible relationships rather than actual ones. In this way, the connection trap can be reintroduced at query time, even in modern relational systems such as PostgreSQL, and similarly through $lookup operations in MongoDB.
PostgreSQL — reproducing the connection trap
This model declares suppliers, parts, projects, and two independent many‑to‑many relationships:
CREATE TABLE suppliers (
supplier_id TEXT PRIMARY KEY
);
CREATE TABLE parts (
part_id TEXT PRIMARY KEY
);
CREATE TABLE projects (
project_id TEXT PRIMARY KEY
);
-- Supplier supplies parts
CREATE TABLE supplier_part (
supplier_id TEXT REFERENCES suppliers,
part_id TEXT REFERENCES parts,
PRIMARY KEY (supplier_id, part_id)
);
-- Project uses parts
CREATE TABLE project_part (
project_id TEXT REFERENCES projects,
part_id TEXT REFERENCES parts,
PRIMARY KEY (project_id, part_id)
);
This follows Codd’s classic suppliers–parts–projects example, where suppliers supply parts and projects use parts as independent relationships.
The following data asserts that project Alpha uses parts P1 and P2, that supplier S1 supplies parts P1 and P2, and that supplier S2 supplies parts P2 and P3:
INSERT INTO suppliers VALUES ('S1'), ('S2');
INSERT INTO parts VALUES ('P1'), ('P2'), ('P3');
INSERT INTO projects VALUES ('Alpha');
-- Supplier capabilities
INSERT INTO supplier_part VALUES
('S1', 'P1'),
('S1', 'P2'),
('S2', 'P2'),
('S2', 'P3');
-- Project uses parts P1 and P2
INSERT INTO project_part VALUES
('Alpha', 'P1'),
('Alpha', 'P2');
The following query is valid SQL:
SELECT DISTINCT
sp.supplier_id,
pp.project_id
FROM supplier_part sp
JOIN project_part pp
ON sp.part_id = pp.part_id;
However, this query falls into the connection trap:
supplier_id | project_id
-------------+------------
S2 | Alpha
S1 | Alpha
(2 rows)
As we defined only supplier–part and project–part relationships, any derived supplier–project relationship is not a fact but a relational composition. We know that Alpha uses P1 and P2, and that part P2 can be supplied by either S1 or S2, but we have no record of which supplier actually supplies Alpha.
This query asserts “Supplier S1 supplies project Alpha”, but the data only says: “S1 and S2 supply P2” and “Alpha uses P2”.
This is the connection trap, expressed purely in SQL.
PostgreSQL — the correct relational solution
If a supplier actually supplies a part to a project, that fact must be represented directly. We need a new table:
CREATE TABLE supply (
supplier_id TEXT,
project_id TEXT,
part_id TEXT,
PRIMARY KEY (supplier_id, project_id, part_id),
FOREIGN KEY (supplier_id, part_id)
REFERENCES supplier_part (supplier_id, part_id),
FOREIGN KEY (project_id, part_id)
REFERENCES project_part (project_id, part_id)
);
These foreign keys encode subset constraints between relations and prevent inserting supplies of parts not supplied by the supplier or not used by the project.
This relation explicitly states who supplies what to which project. We assume that the real‑world fact is “Alpha gets part P2 from supplier S1”:
INSERT INTO supply VALUES
('S1', 'Alpha', 'P2');
The correct query reads from this relation:
SELECT supplier_id, project_id
FROM supply;
supplier_id | project_id
-------------+------------
S1 | Alpha
(1 row)
The relationship is now real and asserted, not inferred. In total, we have six tables:
postgres=# \d
List of relations
Schema | Name | Type | Owner
--------+---------------+-------+----------
public | parts | table | postgres
public | project_part | table | postgres
public | projects | table | postgres
public | supplier_part | table | postgres
public | suppliers | table | postgres
public | supply | table | postgres
(6 rows)
In practice, you should either store the relationship explicitly or avoid claiming it exists. Although the relational model avoids pointers, it is still possible to join through an incorrect path, so the application must enforce the correct one.
In ad-hoc query environments such as data warehouses, data is typically organized into domains and modeled using a dimensional ("star schema") approach. Relationships like project–supplier are represented as fact tables within a single data mart, exposing only semantically valid join paths and preventing invalid joins.
MongoDB — reproducing the connection trap
The following MongoDB data mirrors the PostgreSQL example. MongoDB allows representing relationships either as separate collections or by embedding, depending on the bounded context. Here we start with separate collections to mirror the relational model:
db.suppliers.insertMany([
{ _id: "S1" },
{ _id: "S2" }
]);
db.parts.insertMany([
{ _id: "P1" },
{ _id: "P2" },
{ _id: "P3" }
]);
db.projects.insertMany([
{ _id: "Alpha" }
]);
// Supplier capabilities
db.supplier_parts.insertMany([
{ supplier: "S1", part: "P1" },
{ supplier: "S1", part: "P2" },
{ supplier: "S2", part: "P2" },
{ supplier: "S2", part: "P3" }
]);
// Project uses parts P1 and P2
db.project_parts.insertMany([
{ project: "Alpha", part: "P1" },
{ project: "Alpha", part: "P2" }
]);
Using the simple find() API, we cannot fall into the trap directly because there is no implicit connection between suppliers and projects. The application must issue two independent queries and combine the results explicitly.
Simulating the connection trap in a single query therefore requires explicit composition at the application level:
const partsUsedByAlpha = db.project_parts.find(
{ project: "Alpha" },
{ _id: 0, part: 1 }
).toArray();
const suppliersForParts = db.supplier_parts.find(
{ part: { $in: partsUsedByAlpha.map(p => p.part) } },
{ _id: 0, supplier: 1, part: 1 }
).toArray();
const supplierProjectPairs = suppliersForParts.map(sp => ({
supplier: sp.supplier,
project: "Alpha"
}));
print(supplierProjectPairs);
When forced by the application logic, here is the connection trap associating suppliers and projects:
[
{ supplier: 'S1', project: 'Alpha' },
{ supplier: 'S1', project: 'Alpha' },
{ supplier: 'S2', project: 'Alpha' }
]
As with SQL joins, a $lookup in an aggregation pipeline can fall in the same connection trap:
db.supplier_parts.aggregate([
{
$lookup: {
from: "project_parts",
localField: "part",
foreignField: "part",
as: "projects"
}
},
{ $unwind: "$projects" },
{
$project: {
_id: 0,
supplier: "$supplier",
project: "$projects.project"
}
}
]);
The result is similar and the projection removed the intermediate attributes:
{ "supplier": "S1", "project": "Alpha" }
{ "supplier": "S1", "project": "Alpha" }
{ "supplier": "S2", "project": "Alpha" }
We reproduced the connection trap by ignoring that $lookup produces a derived relationship, not a real one, and that matching keys does not carry business meaning.
MongoDB — normalized solution
As with SQL, we can add an explicit supplies collection that stores the relationship between projects and suppliers:
db.supplies.insertOne({
project: "Alpha",
supplier: "S1",
part: "P2"
});
Then we simply query this collection:
db.supplies.find(
{ project: "Alpha" },
{ _id: 0, supplier: 1, part: 1 }
);
[ { supplier: 'S1', part: 'P2' } ]
The document model is a superset of the relational model as relations can be stored as flat collections. The difference is that referential integrity is enforced by the application rather than in-database. To enforce relationships in the database, they must be embedded as sub-documents and arrays.
MongoDB — domain-driven solution
It's not the only solution in a document database, as we can store a schema based on the domain model rather than normalized. MongoDB allows representing this relationship as part of an aggregate. In a project‑centric bounded context, the project is the aggregate root, and the supplier information can be embedded as part of the supply fact:
db.projects.updateOne(
{ _id: "Alpha" },
{
$set: {
parts: [
{ part: "P2", supplier: "S1" },
{ part: "P1", supplier: null }
]
}
},
{ upsert: true }
);
The query doesn't need a join and cannot fall into the connection trap:
db.projects.find(
{ _id: "Alpha" },
{ _id: 1, "parts.supplier": 1 }
);
[
{
_id: 'Alpha',
parts: [
{ supplier: 'S1' },
{ supplier: null }
]
}
]
This avoids the connection trap by construction. It may look like data duplication—the same supplier name may appear in multiple project documents—and indeed this would be undesirable in a fully normalized model shared across all business domains. However, this structure represents a valid aggregate within a bounded context.
In this context, the embedded supplier information is part of the supply fact, not a reference to a global supplier record. If a supplier’s name changes, it is a business decision, not a database decision, whether that change should be propagated to existing projects or whether historical data should retain the supplier name as it was at the time of supply.
Even when propagation is desired, MongoDB allows updating embedded data efficiently:
db.projects.updateMany(
// document filter
{ "parts.supplier": "S1" },
// document update using the array's item from array filter
{
$set: {
"parts.$[p].supplier": "Supplier One"
}
},
// array filter defining the array's item for the update
{
arrayFilters: [{ "p.supplier": "S1" }]
}
);
This update is not atomic across documents, but each document update is atomic and the operation is idempotent and can be safely retried or executed within an explicit transaction if full atomicity is required.
Conclusion
The connection trap occurs whenever relationships are inferred from shared keys, at query time, instead of being explicitly represented as facts, at write time. In SQL, this means introducing explicit association tables and enforcing integrity constraints, rather than deriving then though joins. In MongoDB, it means modeling relationships as explicit documents or embedded subdocuments rather than deriving them through lookups.
In a relational database, the schema is designed to be normalized and independent of specific use cases. All many‑to‑many and fact‑bearing relationships must be declared explicitly, and queries must follow the correct relational path. Referential and cardinality constraints are essential to restrict to only actual facts.
In MongoDB, the data model is typically driven by the domain and the application’s use cases. In a domain-driven design (DDD), strong relationships are modeled as aggregates, embedding related data directly within a document in MongoDB collections. This makes the intended semantics explicit and avoids inferring relationships at query time. Apparent duplication is not a flaw here, but a deliberate modeling choice within a bounded context.
Ultimately, the connection trap is not fully avoided by the data model, but can be a query-time error with joins and projections: deriving relationships that were never asserted. Whether using normalized relations or domain‑driven documents, the rule is the same—if a relationship is a fact, it must be stored as one.
February 15, 2026
HammerDB tproc-c on a large server, Postgres and MySQL
This has results for HammerDB tproc-c on a small server using MySQL and Postgres. I am new to HammerDB and still figuring out how to explain and present results so I will keep this simple and just share graphs without explaining the results.
The comparison might favor Postgres for the IO-bound workloads because I used smaller buffer pools than normal to avoid OOM. I have to do this because RSS for the HammerDB client grows over time as it buffers more response time stats. And while I used buffered IO for Postgres, I use O_DIRECT for InnoDB. So Postgres might have avoided some read IO thanks to the OS page cache while InnoDB did not.
tl;dr for MySQL
- With vu=40 MySQL 8.4.8 uses about 2X more CPU per transaction and does more than 2X more context switches per transaction compared to Postgres 18.1. I will get CPU profiles soon.
- Modern MySQL brings us great improvements to concurrency and too many new CPU overheads
- MySQL 5.6 and 8.4 have similar throughput at the lowest concurrency (vu=10)
- MySQl 8.4 is a lot faster than 5.6 at the highest concurrency (vu=40)
- Modern Postgres has regressions relative to old Postgres
- The regressions increase with the warehouse count, at wh=4000 the NOPM drops between 3% and 13% depending on the virtual user count (vu).
- Postgres and MySQL have similar throughput for the largest warehouse count (wh=4000)
- Otherwise Postgres gets between 1.4X and 2X more throughput (NOPM)
- an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
- 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
- 128G RAM
- Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
- prior to 9.6 the config file is named my.cnf.cz12a50g_c32r128 (z12a50g_c32r128 or z12a50g) and is here for versions 5.6, 5.7, 8.0 and 8.4
- for 9.6 it is named my.cnf.cz13a50g_c32r128 (z13a50g_c32r128 or z13a50g) and is here
The benchmark was run for several workloads:
- vu=10, wh=1000 - 10 virtual users, 1000 warehouses
- vu=20, wh=1000 - 20 virtual users, 1000 warehouses
- vu=40, wh=1000 - 40 virtual users, 1000 warehouses
- vu=10, wh=2000 - 10 virtual users, 2000 warehouses
- vu=20, wh=2000 - 20 virtual users, 2000 warehouses
- vu=40, wh=2000 - 40 virtual users, 2000 warehouses
- vu=10, wh=4000 - 10 virtual users, 4000 warehouses
- vu=20, wh=4000 - 20 virtual users, 4000 warehouses
- vu=40, wh=4000 - 40 virtual users, 4000 warehouses
- stored procedures are enabled
- partitioning is used because the warehouse count is >= 1000
- a 5 minute rampup is used
- then performance is measured for 60 minutes
- average wMB/s increases with the warehouse count for Postgres but not for MySQL
- r/s increases with the warehouse count for Postgres and MySQL
- CPU utilization is almost 2X larger for MySQL
- Context switch rates are more than 2X larger for MySQL
- In the future I hope to learn why MySQL uses almost 2X more CPU per transaction and has more than 2X more context switches per transaction relative to Postgres
(NOPM for some-version / NOPM for base-version)
I provide three charts below:
- only MySQL - base-version is MySQL 5.6.51
- only Postgres - base-version is Postgres 12.22
- Postgres vs MySQL - base-version is Postgres 18.1, some-version is MySQL 8.4.8
Legend:
- my5651.z12a is MySQL 5.6.51 with the z12a50g config
- my5744.z12a is MySQL 5.7.44 with the z12a50g config
- my8045.z12a is MySQL 8.0.45 with the z12a50g config
- my8408.z12a is MySQL 8.4.8 with the z12a50g config
- my9500.z13a is MySQL 9.6.0 with the z13a50g config
Summary
- At the lowest concurrency (vu=10) MySQL 8.4.8 has similar throughput as 5.6.51 because CPU regressions in modern MySQL offset the concurrency improvements.
- At the highest concurrency (vu=40) MySQL 8.4.8 is much faster than 5.6.51 and the regressions after 5.7 are small. This matches what I have seen elsewhere -- while modern MySQL suffers from CPU regressions it benefits from concurrency improvements. Imagine if we could get those concurrency improvements without the CPU regressions.
And the absolute NOPM values are here:
Results: Postgres 12 to 18
Legend:
- pg1222 is Postgres 12.22 with the x10a50g config
- pg1323 is Postgres 13.23 with the x10a50g config
- pg1420 is Postgres 14.20 with the x10a50g config
- pg1515 is Postgres 15.15 with the x10a50g config
- pg1611 is Postgres 16.11 with the x10a50g config
- pg177 is Postgres 17.7 with the x10a50g config
- pg181 is Postgres 18.1 with the x10b50g config
Summary
- Modern Postgres has regressions relative to old Postgres
- The regressions increase with the warehouse count, at wh=4000 the NOPM drops between 3% and 13% depending on the virtual user count (vu).
by Mark Callaghan (noreply@blogger.com)
Butlers or Architects?In a recent viral post, Matt Shumer declares dramatically that we've crossed an irreversible threshold. He asserts that the latest AI models now exercise independent judgment, that he simply gives an AI plain-English instructions, steps away for a few hours, and returns to a flawlessly finished product that surpasses his own capabilities. In the near future, he claims, AI will autonomously handle all knowledge work and even build the next generation of AI itself, leaving human creators completely blindsided by the exponential curve. This was a depressing read. The dramatic tone lands well. And by extrapolating from progress in the last six years, it's hard to argue against what AI might achieve in the next six. I forwarded this to a friend of mine, who had the misfortune of reading it before bed. He told me he had a nightmare about it, dreaming of himself as an Uber driver, completely displaced from his high-tech career. Someone on Twitter had a come back: "The thing I don't get is: Claude Code is writing 100% of Claude's code now. But Anthropic has 100+ open dev positions on their jobs page?" Boris Cherny of Anthropic replied: "The reality is that someone has to prompt the Claudes, talk to customers, coordinate with other teams, and decide what to build next. Engineering is changing, and great engineers are more important than ever." This is strongly reminiscent of the Shell Game podcast I wrote about recently. And it connects to my arguments in "Agentic AI and The Mythical Agent-Month" about the mathematical laws of scaling coordination. Throwing thousands of AI agents at a project does not magically bypass Brooks' Law. Agents can dramatically scale the volume of code generated, but they do not scale insight. Coordination complexity and verification bottlenecks remain firmly in place. Until you solve the epistemic gap of distributed knowledge, adding more agents simply produces a faster, more expensive way to generate merge conflicts. Design, at its core, is still very human. Trung Phan's recent piece on how Docusign still employs 7,000 people in the age of AI provides useful context as well. Complex organizations don't dissolve overnight. Societal constructs, institutional inertia, regulatory frameworks, and the deeply human texture of business relationships all act as buffers. The world changes slower than the benchmarks suggest. So we are nowhere near a fully autonomous AI that sweeps up all knowledge work and solves everything. When we step back, two ways of reading the situation come into view. The first is that we are all becoming butlers for LLMs: priming the model, feeding it context in careful portions, adding constraints, nudging tone, coaxing the trajectory. Then stepping back to watch it cook. We do the setup and it does the real work. But as a perennial optimist, I think we are becoming architects. Deep work will not disappear, rather it will become the only work that matters. We get to design the blueprint, break down logic in high-level parts, set the vision, dictate strategy, and chart trajectory. We do the real thinking, and then we make the model grind.
In anyway, this shift brings a real danger. If we delegate execution, it becomes tempting to delegate thought gradually. LLMs make thinking feel optional. People were already reluctant to think; now they can bypass it entirely. It is unsettling to watch a statistical prediction machine stand in for reasoning. Humbling, too. Maybe we're not as special as we assumed. This reminds me Ted Chiang's story "Catching Crumbs from the Table" where humanity is reduced to interpreting the outputs of a vastly superior intellect. Human scientists no longer produce breakthroughs themselves; they spend their careers reverse-engineering discoveries made by "metahumans". The tragedy is that humans are no longer the source of the insight, they are merely trying to explain metahumans' genius. The title captures the feeling really well. We're not at the table anymore. We're just gathering what falls from it. Even if things come to that, I know I'll keep thinking, keep learning, keep striving to build things. As I reflected in an earlier post on finding one's true calling, this pursuit of knowledge and creation is my dharma. That basic human drive to understand things and build things is not something an LLM can automate away. This I believe. I recently launched a free email newsletter for the blog. Subscribe here to get these essays delivered to your inbox, along with behind-the-scenes commentary and curated links on distributed systems, technology, and other curiosities. Generating vector embeddings for semantic search locallyThis is an external post of mine. Click here if you are not redirected. February 14, 2026HammerDB tproc-c on a small server, Postgres and MySQLThis has results for HammerDB tproc-c on a small server using MySQL and Postgres. I am new to HammerDB and still figuring out how to explain and present results so I will keep this simple and just share graphs without explaining the results. tl;dr
Builds, configuration and hardware I compiled Postgres versions from source: 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1. I compiled MySQL versions from source: 5.6.51, 5.7.44, 8.0.44, 8.4.7, 9.4.0 and 9.5.0. The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here. For MySQL the config files are named my.cnf.cz12a_c8r32 and are here: 5.6.51, 5.7.44, 8.0.4x, 8.4.x, 9.x.0. For both Postgres and MySQL fsync on commit is disabled to avoid turning this into an fsync benchmark. The server has an SSD with high fsync latency. Benchmark The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C. The benchmark was run for several workloads:
The w=100 workloads are less heavy on IO. The w=1000 and w=2000 workloads are more heavy on IO. The benchmark for Postgres is run by this script which depends on scripts here. The MySQL scripts are similar.
Results My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO). On the charts that follow y-axis does not start at 0 to improve readability at the risk of overstating the differences. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
I provide three charts below:
Results: MySQL 5.6 to 8.4 Legend:
Summary
Results: Postgres 12 to 18 Legend:
Summary
Results: MySQL vs Postgres Legend:
Summary
Cross join in MongoDBRelational database joins are, conceptually, a cartesian product followed by a filter (the join condition). Without that condition, you get a cross join that returns every possible combination. In MongoDB, you can model the same behavior at read time using ExampleDefine two collections: one for clothing sizes and one for gender-specific fits:
Each collection stores independent characteristics, and every size applies to every fit. The goal is to generate all valid product variants. Cross join on read: $lookup + $unwindIn order to add all sizes to each body shape, use a $lookup without filter condition and, as it adds them as an embedded array, use $unwind to get one document per combination:
Application-sideFor such small static reference collections, the application may simply read both and join with loops:
While it's good to keep the reference in a database, such static data can stay in cache in the application. Cross join on write: embed the many-to-manyBecause sizes are inherently tied to body shapes (no size exists without a body shape), embedding them in the
Here is the new shape of the single collection: Once embedded, the query becomes straightforward, simply unwind the embedded array:
You may embed only the fields required, like the size code, or all fields like I did here with the neck size, and then remove the size collection:
Although this may duplicate the values for each body shape, it only requires using
Duplication has the advantage of returning all required information in a single read, without joins or multiple queries, and it is not problematic for updates since it can be handled with a single bulk update operation. Unlike relational databases—where data can be modified through ad‑hoc SQL and business rules must therefore be enforced at the database level—MongoDB applications are typically domain‑driven, with clear ownership of data and a single responsibility for performing updates. In that context, consistency is maintained by the application's service rather than by cross‑table constraints. This approach also lets business rules evolve, such as defining different sizes for men and women, without changing the data model. ConclusionIn a fully normalized relational model, all relationships use the same pattern: a one-to-many relationship between two tables, enforced by a primary (or unique) key on one side and a foreign key on the other. This holds regardless of cardinality (many can be three or one million), lifecycle rules (cascade deletes or updates), ownership (shared or exclusive parent), navigation direction (and access patterns). Even many-to-many relationships are just two one-to-many relationships via a junction table. MongoDB exposes these same concepts as modeling choices—handled at read time with February 13, 2026Supabase incident on February 12, 2026
A detailed account of the February 12 outage in us-east-2, what caused it, and the steps we are taking to prevent it from happening again.
How We Optimized Top K in Postgres
How ParadeDB uses principles from search engines to optimize Postgres' Top K performance.
|