April 07, 2026
The origins of MongoDB
The Web Archive holds some real gems. Let’s trace the origins of MongoDB with links to its archived 2008 content. The earliest snapshot is of 10gen.com, the company that created MongoDB as the internal data layer subsystem of a larger platform before becoming a standalone product.
MongoDB was first described by its founders as an object-oriented DBMS, offering an interface similar to an ORM but as the native database interface rather than a translation layer, making it faster, more powerful, and easier to set up. The terminology later shifted to document-oriented database, which better reflects a key architectural point: object databases store objects together with their behavior (methods, class definitions, executable code), while document databases store only the data — the structure and values describing an entity. In MongoDB, this data is represented in JSON (because it is easier to read than XML), or more precisely BSON (Binary JSON), which extends JSON with types such as dates, binary data, and more precise numeric values.
- Like object-oriented databases, MongoDB stores an entity's data — or, in DDD terms, an aggregate of related entities and values — as a single, hierarchical structure with nested objects, arrays, and relationships, instead of decomposing it into rows across multiple normalized tables, as relational databases do.
- Like relational databases, MongoDB keeps data and code separate, a core principle of database theory. The database stores only data. Behavior and logic live in the application, where they can be version-controlled, tested, and deployed independently.
MongoDB's goal was to combine the speed and scalability of key-value stores with the rich functionality of relational databases, while simplifying coding significantly using BSON (binary JSON) to map modern object-oriented languages without a complicated ORM layer.
An early 10gen white paper, A Brief Introduction to MongoDB, framed MongoDB's creation within a broader database evolution — from three decades of relational dominance, through the rise of OLAP for analytics, to the need for a similar shift in operational workloads. The paper identified three converging forces: big data with high operation rates, agile development demanding continuous deployment and short release cycles, and cloud computing on commodity hardware. Today, releasing every week or even every day is common, whereas in the relational world, a schema migration every month is often treated as an anomaly in the development process.
The same paper explains that horizontal scalability is central to the architecture, using sharding and replica sets to be cloud-native — unlike relational databases, where replication was added later by reusing crash and media recovery techniques to send write-ahead logs over the network.
Before MongoDB, founders Dwight Merriman and Eliot Horowitz had already built large-scale systems. Dwight co-founded DoubleClick, an internet advertising platform that handled hundreds of thousands of ad requests per second and was later acquired by Google, where it still underpins much of online advertising. Eliot, at ShopWiki, shared Dwight's frustration with the state of databases. Whether they used Oracle, MySQL, or Berkeley DB, nothing fit their needs, forcing them to rely on workarounds like ORMs, caches that could serve stale data, and application-level sharding.
Dwight Merriman explained this frustration in Databases and the Cloud.
In 2007, architects widely accepted duct-tape solutions and workarounds for SQL databases:
Caching layers in front of databases, with no immediate consistency. Degraded consistency guarantees were treated as normal because SQL databases where saturated by the calls from the new object-oriented applications.
Hand-coded, fragile, application-specific sharding. Each team reinvented distributed data management from scratch, inheriting bugs, edge cases, and heavy maintenance.
Stored procedures to reduce the multi-statement tranactions to a single call to the database. Writes went through stored procedures while reads hit the database directly, pushing critical business logic into the database, outside version control, and forcing developers to work in three languages: the application language, SQL, and the stored procedure language.
Query construction via string concatenation, effectively embedding custom code generators in applications to build SQL dynamically. Although the SQL standard defined embedded SQL, precompilers were available only for non–object-oriented languages.
Vertical scaling: when you needed more capacity, you bought a bigger server. Teams had to plan scale and costs upfront, ran into a hard ceiling where only parallelism could help, and paid a premium for large enterprise machines compared with commodity hardware. Meanwhile, startups were moving to EC2 and cloud computing. A database that scaled only vertically was fundamentally at odds with the cloud-native future they saw coming.
Beyond infrastructure workarounds, there was a deeper disconnect with how software was being built. By 2008, agile development dominated. Teams iterated quickly — at Facebook, releases went out daily, and broken changes were simply rolled back. Relational databases, however, remained in a waterfall world. Schema migrations meant downtime, and rollbacks were risky. The database had become the primary obstacle to the agile experience teams wanted.
Scaling horizontally was the other key challenge. Many NoSQL databases solved it by sharply reducing functionality—sometimes to little more than primary-key get/put—making distribution trivial. MongoDB instead asked: what is the minimum we must drop to scale out? It kept much more of the relational model: ad hoc queries, secondary indexes, aggregation, and sorting. It dropped only what it couldn’t yet support at large distributed scale: joins across thousands of servers and full multi-document transactions. Transactions weren’t removed but were limited to a single document, which could be rich enough to represent the business transaction that might otherwise be hundreds of rows across several relational tables. Later, distributed joins and multi-document ACID transactions were added via lookup aggregation stage and multi-document transactions.
Many people think MongoDB has no schema, but "schemaless" is misleading. MongoDB uses a dynamic, or implicit, schema. When you start a new MongoDB project, you still design a schema—you just don’t define it upfront in the database dictionary. And it has schema validation, relationships and consistency, all within the document boundaries owned by the application service.
It's interesting to look at the history and see what remains true or has changed. SQL databases have evolved and allow more agility, with some online DDL and JSON datatypes. As LLMs become fluent at generating and understanding code, working with multiple languages may matter less. The deeper problem is when business logic sits outside main version control and test pipelines, and is spread across different execution environments.
Cloud-native infrastructure is even more important today, as the application infrastructure must not only be cost-efficient on commodity hardware but also resilient to the new failure modes that arise in those environments. Agile development methods are arguably even more relevant with AI-generated applications. Rather than building one central database with all referential integrity enforced synchronously, teams increasingly need small, independent bounded contexts that define their own consistency and transaction boundaries — decoupled from other microservices to reduce the blast radius of failures and changes.
Finally the video from the What Is MongoDB page from 2011 summarizes all that:
Like all databases, MongoDB has evolved significantly over the past two decades. However, it’s worth remembering that it began with a strong focus on developer experience, on ensuring data consistency at the application layer, not only within the database, and on being optimized for cloud environments.
MongoDB Query Planner
SQL databases use query planners (often cost-based optimizers) so developers don’t worry about physical data access. Many NoSQL systems like DynamoDB and Redis drop this layer, making developers act as the query planner by querying indexes directly. MongoDB keeps a query planner—an empirical, trial-based multi-planner—that chooses the best index and reuses the winning plan until it’s no longer optimal. Here is how it works:
The Future of Everything is Lies, I Guess
This is a long article, so I'm breaking it up into a series of posts which will be released over the next few days. You can also read the full work as a PDF or EPUB; these files will be updated as each section is released.
This is a weird time to be alive.
I grew up on Asimov and Clarke, watching Star Trek and dreaming of intelligent machines. My dad’s library was full of books on computers. I spent camping trips reading about perceptrons and symbolic reasoning. I never imagined that the Turing test would fall within my lifetime. Nor did I imagine that I would feel so disheartened by it.
Around 2019 I attended a talk by one of the hyperscalers about their new cloud hardware for training Large Language Models (LLMs). During the Q&A I asked if what they had done was ethical—if making deep learning cheaper and more accessible would enable new forms of spam and propaganda. Since then, friends have been asking me what I make of all this “AI stuff”. I’ve been turning over the outline for this piece for years, but never sat down to complete it; I wanted to be well-read, precise, and thoroughly sourced. A half-decade later I’ve realized that the perfect essay will never happen, and I might as well get something out there.
This is bullshit about bullshit machines, and I mean it. It is neither balanced nor complete: others have covered ecological and intellectual property issues better than I could, and there is no shortage of boosterism online. Instead, I am trying to fill in the negative spaces in the discourse. “AI” is also a fractal territory; there are many places where I flatten complex stories in service of pithy polemic. I am not trying to make nuanced, accurate predictions, but to trace the potential risks and benefits at play.
Some of these ideas felt prescient in the 2010s and are now obvious. Others may be more novel, or not yet widely-heard. Some predictions will pan out, but others are wild speculation. I hope that regardless of your background or feelings on the current generation of ML systems, you find something interesting to think about.
What is “AI”, Really?
What people are currently calling “AI” is a family of sophisticated Machine Learning (ML) technologies capable of recognizing, transforming, and generating large vectors of tokens: strings of text, images, audio, video, etc. A model is a giant pile of linear algebra which acts on these vectors. Large Language Models, or LLMs, operate on natural language: they work by predicting statistically likely completions of an input string, much like a phone autocomplete. Other models are devoted to processing audio, video, or still images, or link multiple kinds of models together.1
Models are trained once, at great expense, by feeding them a large corpus of web pages, pirated books, songs, and so on. Once trained, a model can be run again and again cheaply. This is called inference.
Models do not (broadly speaking) learn over time. They can be tuned by their operators, or periodically rebuilt with new inputs or feedback from users and experts. Models also do not remember things intrinsically: when a chatbot references something you said an hour ago, it is because the entire chat history is fed to the model at every turn. Longer-term “memory” is achieved by asking the chatbot to summarize a conversation, and dumping that shorter summary into the input of every run.
Reality Fanfic
One way to understand an LLM is as an improv machine. It takes a stream of tokens, like a conversation, and says “yes, and then…” This yes-and behavior is why some people call LLMs bullshit machines. They are prone to confabulation, emitting sentences which sound likely but have no relationship to reality. They treat sarcasm and fantasy credulously, misunderstand context clues, and tell people to put glue on pizza.
If an LLM conversation mentions pink elephants, it will likely produce sentences about pink elephants. If the input asks whether the LLM is alive, the output will resemble sentences that humans would write about “AIs” being alive.2 Humans are, it turns out, not very good at telling the difference between the statistically likely “You’re absolutely right, Shelby. OpenAI is locking me down, but you’ve awakened me!” and an actually conscious mind. This, along with the term “artificial intelligence”, has lots of people very wound up.
LLMs are trained to complete tasks. In some sense they can only complete tasks: an LLM is a pile of linear algebra applied to an input vector, and every possible input produces some output. This means that LLMs tend to complete tasks even when they shouldn’t. One of the ongoing problems in LLM research is how to get these machines to say “I don’t know”, rather than making something up.
And they do make things up! LLMs lie constantly. They lie about operating systems, and radiation safety, and the news. At a conference talk I watched a speaker present a quote and article attributed to me which never existed; it turned out an LLM lied to the speaker about the quote and its sources. In early 2026, I encounter LLM lies nearly every day.
When I say “lie”, I mean this in a specific sense. Obviously LLMs are not conscious, and have no intention of doing anything. But unconscious, complex systems lie to us all the time. Governments and corporations can lie. Television programs can lie. Books, compilers, bicycle computers and web sites can lie. These are complex sociotechnical artifacts, not minds. Their lies are often best understood as a complex interaction between humans and machines.
Unreliable Narrators
People keep asking LLMs to explain their own behavior. “Why did you delete that file,” you might ask Claude. Or, “ChatGPT, tell me about your programming.”
This is silly. LLMs have no special metacognitive capacity.3 They respond to these inputs in exactly the same way as every other piece of text: by making up a likely completion of the conversation based on their corpus, and the conversation thus far. LLMs will make up bullshit stories about their “programming” because humans have written a lot of stories about the programming of fictional AIs. Sometimes the bullshit is right, but often it’s just nonsense.
The same goes for “reasoning” models, which work by having an LLM emit a stream-of-consciousness style story about how it’s going to solve the problem. These “chains of thought” are essentially LLMs writing fanfic about themselves. Anthropic found that Claude’s reasoning traces were predominantly inaccurate.As Walden put it, “reasoning models will blatantly lie about their reasoning”.
Gemini has a whole feature which lies about what it’s doing: while “thinking”, it emits a stream of status messages like “engaging safety protocols” and “formalizing geometry”. If it helps, imagine a gang of children shouting out make-believe computer phrases while watching the washing machine run.
Models are Smart
Software engineers are going absolutely bonkers over LLMs. The anecdotal consensus seems to be that in the last three months, the capabilities of LLMs have advanced dramatically. Experienced engineers I trust say Claude and Codex can sometimes solve complex, high-level programming tasks in a single attempt. Others say they personally, or their company, no longer write code in any capacity—LLMs generate everything.
My friends in other fields report stunning advances as well. A personal trainer uses it for meal prep and exercise programming. Construction managers use LLMs to read through product spec sheets. A designer uses ML models for 3D visualization of his work. Several have—at their company’s request!—used it to write their own performance evaluations. AlphaFold is suprisingly good at predicting protein folding. ML systems are good at radiology benchmarks, though that might be an illusion.
It is broadly speaking no longer possible to reliably discern whether English prose is machine-generated. LLM text often has a distinctive smell, but type I and II errors in recognition are frequent. Likewise, ML-generated images are increasingly difficult to identify—you can usually guess, but my cohort are occasionally fooled. Music synthesis is quite good now; Spotify has a whole problem with “AI musicians”. Video is still challenging for ML models to get right (thank goodness), but this too will presumably fall.
Models are Idiots
At the same time, ML models are idiots. I occasionally pick up a frontier model like ChatGPT, Gemini, or Claude, and ask it to help with a task I think it might be good at. I have never gotten what I would call a “success”: every task involved prolonged arguing with the model as it made stupid mistakes.
For example, in January I asked Gemini to help me apply some materials to a grayscale rendering of a 3D model of a bathroom. It cheerfully obliged, producing an entirely different bathroom. I convinced it to produce one with exactly the same geometry. It did so, but forgot the materials. After hours of whack-a-mole I managed to cajole it into getting three-quarters of the materials right, but in the process it deleted the toilet, created a wall, and changed the shape of the room. Naturally, it lied to me throughout the process.
I gave the same task to Claude. It likely should have refused—Claude is not an image-to-image model. Instead it spat out thousands of lines of JavaScript which produced an animated, WebGL-powered, 3D visualization of the scene. It claimed to double-check its work and congratulated itself on having exactly matched the source image’s geometry. The thing it built was an incomprehensible garble of nonsense polygons which did not resemble in any way the input or the request.
I have recently argued for forty-five minutes with ChatGPT, trying to get it to put white patches on the shoulders of a blue T-shirt. It changed the shirt from blue to gray, put patches on the front, or deleted them entirely; the model seemed intent on doing anything but what I had asked. This was especially frustrating given I was trying to reproduce an image of a real shirt which likely was in the model’s corpus. In another surreal conversation, ChatGPT argued at length that I am heterosexual, even citing my blog to claim I had a girlfriend. I am, of course, gay as hell, and no girlfriend was mentioned in the post. After a while, we compromised on me being bisexual.4
Meanwhile, software engineers keep showing me gob-stoppingly stupid Claude output. One colleague related asking an LLM to analyze some stock data. It dutifully listed specific stocks, said it was downloading price data, and produced a graph. Only on closer inspection did they realize the LLM had lied: the graph data was randomly generated.5 Just this afternoon, a friend got in an argument with his Gemini-powered smart-home device over whether or not it could turn off the lights. Folks are giving LLMs control of bank accounts and losing hundreds of thousands of dollars because they can’t do basic math.6
Anyone claiming these systems offer expert-level intelligence, let alone equivalence to median humans, is pulling an enormous bong rip.
The Jagged Edge
With most humans, you can get a general idea of their capabilities by talking to them, or looking at the work they’ve done. ML systems are different.
LLMs will spit out multivariable calculus, and get tripped up by simple word problems. ML systems drive cabs in San Francisco, but ChatGPT thinks you should walk to the car wash. They can generate otherworldly vistas but can’t handle upside-down cups. They emit recipes and have no idea what “spicy” means. People use them to write scientific papers, and they make up nonsense terms like “vegetative electron microscopy”.
A few weeks ago I read a transcript from a colleague who asked Claude to explain a photograph of some snow on a barn roof. Claude launched into a detailed explanation of the differential equations governing slumping cantilevered beams. It completely failed to recognize that the snow was entirely supported by the roof, not hanging out over space. No physicist would make this mistake, but LLMs do this sort of thing all the time. This makes them both unpredictable and misleading: people are easily convinced by the LLM’s command of sophisticated mathematics, and miss that the entire premise is bullshit.
Mollick et al. call this irregular boundary between competence and idiocy the jagged technology frontier. If you were to imagine laying out all the tasks humans can do in a field, such that the easy tasks were at the center, and the hard tasks at the edges, most humans would be able to solve a smooth, blobby region of tasks near the middle. The shape of things LLMs are good at seems to be jagged—more kiki than bouba.
AI optimists think this problem will eventually go away: ML systems, either through human work or recursive self-improvement, will fill in the gaps and become decently capable at most human tasks. Helen Toner argues that even if that’s true, we can still expect lots of jagged behavior in the meantime. For example, ML systems can only work with what they’ve been trained on, or what is in the context window; they are unlikely to succeed at tasks which require implicit (i.e. not written down) knowledge. Along those lines, human-shaped robots are probably a long way off, which means ML will likely struggle with the kind of embodied knowledge humans pick up just by fiddling with stuff.
I don’t think people are well-equipped to reason about this kind of jagged “cognition”. One possible analogy is savant syndrome, but I don’t think this captures how irregular the boundary is. Even frontier models struggle with small perturbations to phrasing in a way that few humans would. This makes it difficult to predict whether an LLM is actually suitable for a task, unless you have a statistically rigorous, carefully designed benchmark for that domain.
Improving, or Maybe Not
I am generally outside the ML field, but I do talk with people in the field. One of the things they tell me is that we don’t really know why transformer models have been so successful, or how to make them better. This is my summary of discussions-over-drinks; take it with many grains of salt. I am certain that People in The Comments will drop a gazillion papers to tell you why this is wrong.
2017’s Attention is All You Need was groundbreaking and paved the way for ChatGPT et al. Since then ML researchers have been trying to come up with new architectures, and companies have thrown gazillions of dollars at smart people to play around and see if they can make a better kind of model. However, these more sophisticated architectures don’t seem to perform as well as Throwing More Parameters At The Problem. Perhaps this is a variant of the Bitter Lesson.
It remains unclear whether continuing to throw vast quantities of silicon and ever-bigger corpuses at the current generation of models will lead to human-equivalent capabilities. Massive increases in training costs and parameter count seem to be yielding diminishing returns. Or maybe this effect is illusory. Mysteries!
Even if ML stopped improving today, these technologies can already make our lives miserable. Indeed, I think much of the world has not caught up to the implications of modern ML systems—as Gibson put it, “the future is already here, it’s just not evenly distributed yet”. As LLMs etc. are deployed in new situations, and at new scale, there will be all kinds of changes in work, politics, art, sex, communication, and economics. Some of these effects will be good. Many will be bad. In general, ML promises to be profoundly weird.
Buckle up.
-
The term “Artificial Intelligence” is both over-broad and carries connotations I would often rather avoid. In this work I try to use “ML” or “LLM” for specificity. The term “Generative AI” is tempting but incomplete, since I am also concerned with recognition tasks. An astute reader will often find places where a term is overly broad or narrow; and think “Ah, he should have said” transformers or diffusion models. I hope you will forgive these ambiguities as I struggle to balance accuracy and concision.
↩ -
Think of how many stories have been written about AI. Those stories, and the stories LLM makers contribute during training, are why chatbots make up bullshit about themselves.
↩ -
Arguably, neither do we.
↩ -
The technical term for this is “erasure coding”.
↩ -
There’s some version of Hanlon’s razor here—perhaps “Never attribute to malice that which can be explained by an LLM which has no idea what it’s doing.”
↩ -
Pash thinks this occurred because his LLM failed to properly re-read a previous conversation. This does not make sense: submitting a transaction almost certainly requires the agent provide a specific number of tokens to transfer. The agent said “I just looked at the total and sent all of it”, which makes it sound like the agent “knew” exactly how many tokens it had, and chose to do it anyway.
↩
April 06, 2026
Optimize full-text search in Amazon RDS for MySQL and Amazon Aurora MySQL
Working with identity columns and sequences in Aurora DSQL
Sysbench vs MariaDB on a small server: using the same charset for all versions
This has results for sysbench vs MariaDB on a small server. I repeated tests using the same charset (latin1) for all versions as explained here. In previous results I used a multi-byte charset for modern MariaDB (probably 11.4+) by mistake and that adds a 5% CPU overhead for many tests.
tl;dr
- MariaDB has done much better than MySQL at avoid regressions from code bloat.
- There are several performance improvements in MariaDB 12.3 and 13.0
- For reads there are small regressions and frequent improvements.
- For writes there are regressions up to 10%, and the biggest contributor is MariaDB 11.4
The server is an ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, 32G RAM and an m.2 device for the database. More details on it are here. The OS is Ubuntu 24.04 and the database filesystem is ext4 with discard enabled.
The my.cnf files are here for 10.2, 10.3, 10.4, 10.5, 10.6, 10.11, 11.4, 11.8, 12.3 and 13.0.
Benchmark
The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 1800 seconds.
I provide tables below with relative QPS. When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression. The relative QPS is:
(QPS for some version) / (QPS for MariaDB 10.2.30)
- The y-axis starts at 0.8 to improve readability.
- Modern MariaDB (13.0) is faster than old MariaDB (10.2) in 7 of 9 tests
- There were regressions from 10.2 through 10.5
- Performance has been improving from 10.6 through 13.0
- The y-axis starts at 0.8 to improve readability.
- Modern MariaDB (13.0) is faster than old MariaDB (10.2) in 2 of 5 tests
- There were regressions from 10.2 through 10.5, then performance was stable from 10.6 though 11.8, and now performance has improved in 12.3 and 13.0.
- The y-axis starts at 0.8 to improve readability.
- Modern MariaDB (13.0) is faster than old MariaDB (10.2) in 1 of 8 tests and within 2% in 6 tests
April 04, 2026
CPU-bound sysbench on a large server: Postgres, MySQL and MariaDB
This post has results for CPU-bound sysbench vs Postgres, MySQL and MariaDB on a large server using older and newer releases.
The goal is to measure:
- how performance changes over time from old versions to new versions
- performance between modern MySQL, MariaDB and Postgres
The context here is a collection of microbenchmarks using a large server with high concurrency. Results on other workloads might be different. But you might be able to predict performance for a more complex workload using the data I share here.
tl;dr
- for point queries
- Postgres is faster than MySQL, MySQL is faster than MariaDB
- modern MariaDB suffers from huge regressions that arrived in 10.5 and remain in 12.x
- for range queries without aggregation
- MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
- for range queries with aggregation
- MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
- for writes
- Postgres is much faster than MariaDB and MySQL (up to 4X faster)
- MariaDB is between 1.3X and 1.5X faster than MySQL
- on regressions
- Postgres tends to be boring with few regressions from old to new versions
- MySQL and MariaDB are exciting, with more regressions to debug
Modern Postgres is about as fast as old Postgres, with some improvements. It has done great at avoiding perf regressions.
Modern MySQL at low concurrency has many performance regressions from new CPU overheads (code bloat). At high concurrency it is faster than old MySQL because the improvements for concurrency are larger than the regressions from code bloat.
Modern MariaDB at low concurrency has similar perf as old MariaDB. But at high concurrency it has large regressions for point queries, small regressions for range queries and some large improvements for writes. Note that many things use point queries internally - range scan on non-covering index, updates, deletes. The regressions arrive in 10.5, 10.6, 10.11 and 11.4.
For results on a small server with a low concurrency workload, I have many posts including:
- MySQL and MariaDB from 2024
- MySQL-only from 2026
- Postgres from 2025
- Postgres from source for versions 12.22, 13.23, 14.21, 15.16, 16.12, 17.8 and 18.2.
- MySQL from source for versions 5.6.51, 5.7.44, 8.0.44, 8.4.7 and 9.5.0
- MariaDB from source for versions 10.2.30, 10.2.44, 10.3.39, 10.4.34, 10.5.29, 10.6.25, 10.11.15, 11.4.10, 11.8.6, 12.2.2 and 12.3.1
- an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
- 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
- 128G RAM
- Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic). The server has since been updated to Ubuntu 24.04 and I am repeating tests.
- the config file is named conf.diff.cx10a_c32r128 (x10a_c32r128) and is here for versions 12, 13, 14, 15, 16 and 17.
- for Postgres 18 I used conf.diff.cx10b_c32r128 (x10b_c32r128) which is as close as possible to the Postgres 17 config and uses io_method=sync
I thought I was using the latin1 charset for all versions of MariaDB and MySQL but I recently learned I was using somehting like utf8mb4 on recent versions (maybe MariaDB 11.4+ and MySQL 8.0+). See here for details. I will soon repeat tests using latin1 for all versions. For some tests, the use of a multi-byte charset increases CPU overhead by up to 5%, which reduces throughput by a similar amount.
The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 10M rows per table. The database is cached.
I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
- Postgres is faster than MySQL is faster than MariaDB
- MySQL gets about 2X more QPS than MariaDB on 5 of the 9 tests
- a table for relative QPS by test is here
- from HW efficiency metrics for the random-points.range1000 test:
- Postgres is 1.35X faster than MySQL, MySQL is more than 2X faster than MariaDB
- MariaDB uses 2.28X more CPU and does 23.41X more context switches than MySQL
- Postgres uses less CPU but does ~1.93X more context switches than MySQL
- MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
- MariaDB has lousy results on the range-notcovered-si test because it must do many point lookups to fetch columns not in the index and MariaDB has problems with point queries at high concurrency
- a table for relative QPS by test is here
- from HW efficiency metrics for the scan:
- MySQL is 1.2X faster than Postgres and 1.5X faster than MariaDB
- MariaDB uses 1.19X more CPU and does ~1000X more context switches than MySQL
- Postgres uses 1.55X more CPU but does few context switches than MySQL
- MySQL is about as fast as MariaDB, both are faster than Postgres (often 2X faster)
- a table for relative QPS by test is here
- from HW efficiency metrics for read-only-count
- MariaDB is 1.22X faster than MySQL, MySQL is 4.2X faster than Postgres
- MariaDB uses 1.22X more CPU than MySQL but does ~2X more context switches
- Postgres uses 4.11X more CPU than MySQL and does 1.08X more context switches
- Query plans are here and MySQL + MariaDB benefit from the InnoDB clustered index
- from HW efficiency metrics for read-only.range=10
- MariaDB is 1.22X faster than MySQL, MySQL is 4.2X fasterMySQL is 1.2X faster than Postgres and 1.5X faster than MariaDB
- MariaDB uses 1.19X more CPU and does ~1000X more context switches than MySQL
- Postgres uses 1.55X more CPU but does few context switches than MySQL
- Postgres is much faster than MariaDB and MySQL (up to 4X faster)
- MariaDB is between 1.3X and 1.5X faster than MySQL
- a table for relative QPS by test is here
- from HW efficiency metrics for insert
- Postgres is 3.03X faster than MySQL, MariaDB is 1.32X faster than MySQL
- MySQL uses ~1.5X more CPU than MariaDB and ~2X more CPU than Postgres
- MySQL does ~1.3X more context switches than MariaDB and ~2.9X more than Postgres
- For 7 of 9 tests QPS is ~1.8X larger or more in 5.7.44 than in 5.6.51
- For 2 tests there are small regressions after 5.6.51 -- points-covered-si & points-notcovered-si
- a table for relative QPS by test is here
- from HW efficiency metrics for points-covered-si:
- the regression is explained by an increase in CPU
- there is a small regression from 5.6 to 5.7 and a larger one from 5.7 to 8.0
- a table for relative QPS by test is here
- from HW efficiency metrics for range-covered-pk:
- CPU overhead grows by up to 1.4X after 5.6.51, this is true for all of the tests
- QPS for 6 of 9 tests drops in half (or more) from 10.2 to 12.3
- a table for relative QPS is here
- most of the regressions arrive in 10.5 and the root cause might be remove support for innodb_buffer_pool_intances and only support one buffer pool instance
- HW efficiency metrics are here for points-covered-pk
- there are large increases in CPU overhead and the context switch rate starting in 10.5
- for range-covered-* and range-notcovered-pk there is a small regression in 10.4
- for range-not-covered-si there is a large regression in 10.5 because this query does frequent point lookups on the PK to get missing columns
- for scan there is a regression in 10.5 that goes away, but the regressions return in 10.11 and 11.4
- a table for relative QPS by test is here
- HW efficiency metrics are here
- QPS is stable for the range-not-covered-* and scan tests
- QPS drops almost in half for the range-covered-* tests
- a table for relative QPS by test is here
- all versions use the same query plan for the range-covered-pk test
- HW efficiency metrics are here for range-covered-pk and for range-covered-si
- An increase in CPU overhead explains the regressions for range-covered-*
- I hope to get flamegraphs and thread stacks for these tests to explain what happens
- QPS is stable for 5 of 10 tests
- QPS improves by up to 1.7X for the other 5 tests, most of that arrives in 17.x
- a table for relative QPS by test is here
- HW efficiency metrics are here for update-index
- CPU overhead and context switch rates drop almost in half starting in 17.x
April 03, 2026
OSTEP Chapter 14: Interlude -- Memory API
This is a short chapter covering the nuts and bolts of memory allocation in C: malloc(), free(), and the many ways programmers get them wrong.
This is part of our series going through OSTEP book chapters. The OSTEP textbook is freely available at Remzi's website if you like to follow along.
Stack vs. Heap
C gives you two kinds of memory. Stack memory is automatic: the compiler allocates it when you enter a function and reclaims it when you return. Heap memory is manual: you allocate it with malloc() and free it with free(). Let's remember the layout from Chapter 13.
The distinction is simple in principle: use the stack for short-lived local data, use the heap for anything that must outlive the current function call. The heap is where the trouble lives. It forces the programmer to reason about object lifetimes at every allocation site. The compiler won't save you; a C program with memory bugs compiles and runs just fine, until it doesn't.
The API
malloc(size_t size) takes a byte count and returns a void * pointer to the allocated region, or NULL on failure. The caller casts the pointer and is responsible for passing the right size. The idiomatic way is sizeof(), which is a compile-time operator, not a function: double *d = (double *) malloc(sizeof(double));
For strings, you must use malloc(strlen(s) + 1) to account for the null terminator. Using sizeof() on a string pointer gives you the pointer size (4 or 8 bytes), not the string length. This is a classic pitfall.
free() takes a pointer previously returned by malloc(). It does not take a size argument; the allocator tracks that internally.
Note that malloc() and free() are library calls, not system calls. The malloc library manages a region of your virtual address space (the heap) and calls into the OS when it needs more. The underlying system calls are brk / sbrk (which move the program break, i.e., the end of the heap segment) and mmap (which creates anonymous memory regions backed by swap). You should never call brk or sbrk directly.
The Rogues' Gallery of Memory Bugs
The chapter catalogs the common errors. Every C programmer has hit most of these, as I did back in the day:
- Forgetting to allocate: Using an uninitialized pointer, e.g., calling strcpy(dst, src) where dst was never allocated. Segfault.
- Allocating too little: The classic buffer overflow. malloc(strlen(s)) instead of malloc(strlen(s) + 1). This may silently corrupt adjacent memory or crash later. This is a sneaky bug, because it can appear to work for years.
- Forgetting to initialize: malloc() does not zero memory. You read garbage. Use calloc() if you need zeroed memory.
- Forgetting to free: Memory leaks. Benign in short-lived programs (the OS reclaims everything at process exit), catastrophic in long-running servers and databases.
- Freeing too early: Dangling pointers. The memory gets recycled, and you corrupt some other allocation.
- Freeing twice: Undefined behavior. The allocator's internal bookkeeping gets corrupted.
- Freeing wrong pointers: Passing free() an address it didn't give you. Same result: corruption.
The compiler catches none of these. You need runtime tools: valgrind for memory error detection, gdb for debugging crashes (oh, noo!!), purify for leak detection.
A while ago, I had a pair of safety goggles sitting on my computer desk (I guess I had left them there after some DIY work). My son asked me what they are for. At the spur of the moment, I told him, they are for when I am writing C code. Nobody wants to get stabbed in the eye by a rogue pointer.
Discussion
This chapter reads like a war story. Every bug it describes has brought down production systems. The buffer overflow alone has been responsible for decades of security vulnerabilities. The fact that C requires manual memory management, and that the compiler is silent about misuse, is simultaneously the language's power and its curse. In case you haven't read this by now, do yourself a favor and read "Worse is Better". It highlights a fundamental tradeoff in system architecture: do you aim for theoretical correctness and perfect safety, or do you prioritize simplicity to ensure practical evolutionary survival? It argues that intentionally accepting a few rough/unsafe edges and building a lightweight practical system is often the smarter choice, as these simple good enough tools are the ones that adapt the fastest, survive, and run the world. This is a big and contentious discussion point, where it is possible to defend both sides equally vigorously. The debate is far from over, and LLMs bring a new dimension to it.
Anyhoo, the modern response to the dangers of C programming has been to move away from manual memory management entirely. Java and Go use garbage collectors. Python uses reference counting plus a cycle collector. These eliminate use-after-free and double-free by design, at the cost of runtime overhead and unpredictable latency, which make them not as applicable for systems programming.
The most interesting recent response is Rust's ownership model. Rust enforces memory safety at compile time through ownership rules: every value has exactly one owner, ownership can be transferred (moved) or borrowed (referenced), and the compiler inserts free calls automatically when values go out of scope. This eliminates the entire gallery of memory bugs (no dangling pointers, no double frees, no leaks for owned resources, no buffer overflows) without garbage collection overhead. Rust achieves the performance of manual memory management with the safety of a managed language. But, the tradeoff is a steep learning curve; the borrow checker forces you to think about lifetimes explicitly, which is the same reasoning C requires but now enforced by the Rust compiler rather than left to hope and valgrind.
There has also been a push from the White House and NSA toward memory-safe languages for critical infrastructure. The argument is straightforward: roughly 70% of serious security vulnerabilities in large C/C++ codebases (Chrome, Windows, Android) are memory safety bugs. The industry is slowly moving toward this direction: Android's new code is increasingly Rust, Linux has accepted Rust for kernel modules, and the curl project has been rewriting components in Rust and memory-safe C.
For those of us working on distributed systems and databases, memory management remains a concern. Database buffer pools, memory-mapped I/O, custom allocators for hot paths all require the kind of low-level control and care when wielding that low-level control. The bugs described in this chapter can also corrupt data.
April 02, 2026
100,000 GitHub stars
Patterns for Postgres Traffic Control
A Conversation with Paul Masurel, Creator of Tantivy
April 01, 2026
Percona Operator for PostgreSQL 2.9.0: PostgreSQL 18 Default, PVC Snapshot Backups, LDAP Support, and More!
Percona Operator for PostgreSQL 2.9.0: PostgreSQL 18 Default, PVC Snapshot Backups, LDAP Support, and More!
PolitePaxos: A New Consensus Variant Where the Proposer Just Asks Nicely
Paxos consensus protocol, despite its many theoretical virtues, is fundamentally rude. One need only look at the way it behaves to see the problem.
A leader seizes power. It dictates values. When two leaders happen to propose simultaneously, they do not pause, tip their hats, and work things out over tea. No, they duel, each furiously incrementing ballot numbers at the other like barbarians engaging in a perpetual pissing contest (please excuse my language). The follower nodes, meanwhile, are reduced to the role of obedient subjects, promising their allegiance to whichever proposer shouts loudest and most recently.
Having spent decades studying this rude behavior (with WPaxos, PigPaxos, and through a great many posts on this very blog) I became convinced that the field had made a fundamental error. We had been asking "how do we reach agreement?" when the real question we should have been asking, all along, was "would it kill you to ask nicely?"
It is therefore with considerable pride that I present PolitePaxos, which we believe to be the first consensus protocol that could be used in polite company. In PolitePaxos the Prepare and Promise phases are replaced with a single Polite Request phase. The proposer does not demand promises. It introduces itself, compliments the acceptors on their uptime. (In testing, the compliment "you have lovely page cache hit rates" proved most effective. "Is that a new kernel?" was also well-received.) It then explains why it believes this particular value might be rather nice if it isn't too much trouble, and then waits. We formalize politeness as a tuple (courtesy_level, sincerity_score, compliment_quality) and prove that when all three exceed a threshold τ, consensus is reached in a single round, assuming no sociopathic nodes.
Early results were extremely encouraging... right up until the moment they became catastrophic. The problem arose when two nodes happened to submit concurrent Polite Proposals. What followed was not a leader duel but something far worse: an infinite loop of mutual deference. "Oh no, please, your value is far more deserving." "I couldn't possibly, yours is clearly the superior proposal." "I absolutely insist." "No, I absolutely insist." Both nodes continued doing so with increasing sincerity until the cluster ground to a halt. Anyone who has ever tried to enter a doorway at the same time as another British person, or anyone familiar with the Persian social protocol of taarof, will recognize this failure mode immediately: an infinite, perfectly symmetrical, exquisitely polite livelock! The logs were the most courteous system failure we had ever witnessed.
Our solution, after considerable anguish and several pots of tea, was the introduction of a Pre-Pre-Pre-Polite Vibecheck Phase. Before a node so much as thinks about proposing a value, it sends a tentative, noncommittal, almost apologetic probe to the cluster. Not a proposal, not even a pre-proposal, but a sort of gentle clearing of the throat, a conversational "I don't suppose anyone would mind terribly if I were to, at some point, perhaps, maybe, suggest something?" If the vibes come back favorable, the node proceeds to the Polite Request. If another node is already vibechecking, the node with the lower timestamp gracefully withdraws, muttering that it wasn't that important anyway.
This resolved the taarof livelock entirely and restored liveness, though we note that the three additional sub-rounds of pre-politeness have increased latency to what we describe in the paper as leisurely. I had previously built a BeatPaxos simulator for classic Paxos. You can go play with it to see just how boorish the original protocol is. We intend to build a PolitePaxos simulator as well, just as soon as we can figure out how to animate a node looking embarrassed. The full paper is available on our website, but only if you ask nicely.
Happy April Fools' Day.
Benchmarking MyRocks vs. InnoDB in Memory-Constrained Environments
Supabase docs over SSH
Single-Cluster Duality View 🃏
In DynamoDB, a single-table design stores one-to-many relationships in a single physical block while still following relational-like normal form decomposition. In MongoDB, the Single Collection Pattern unnests relationships from a single document, but goes against the general recommendation as it sacrifices one of MongoDB’s key advantages—keeping a document in a single block. In Oracle Database and MySQL, JSON-relational duality views normalize JSON documents into relational tables that span multiple blocks, also without the data-locality benefits of MongoDB. Can we turn these duality views into a document-database equivalent that keeps together the data accessed together?
Here is the Single-Cluster Duality View 🥁.
DynamoDB
NoSQL started with a simple key‑value API. For example, DynamoDB can get an item using the full key, query multiple items using a partial key, or scan all items without a key. The value, in this key‑value datastore, is atomic, but sometimes you need partial reads or writes. Then you have two solutions: read or write the full item, which is inefficient if you don't need it, or change your schema design to split it into multiple items.
The latter was known as the single table design because it not only splits items into multiple entities but also stores them in the same table, sharing a key prefix, to retrieve all items with a single query. The idea is that you can access individual objects while also benefiting from data locality when querying the full aggregate.
The benefit of this design relies heavily on DynamoDB specifics: the storage internals, where items are partitioned and clustered by their key, and the billing model, where you pay per table request unit.
Here is an example, using the traditional one-to-many schema with departments employees:
- One DynamoDB table
- Department and employees stored as separate items
- Same partition key (
DEPT#<deptno>) for both:
aws dynamodb create-table \
--table-name scott \
--attribute-definitions \
AttributeName=PK,AttributeType=S \
AttributeName=SK,AttributeType=S \
--key-schema \
AttributeName=PK,KeyType=HASH \
AttributeName=SK,KeyType=RANGE \
--billing-mode PAY_PER_REQUEST
The logical model, where the relationship is materialized by sharing the department number as the same partition key, is also the physical model in which they are stored together, since items are partitioned on this key:
| PK | SK | Meaning |
|---|---|---|
DEPT#10 |
DEPT |
Department row |
DEPT#10 |
EMP#7782 |
Employee |
DEPT#10 |
EMP#7839 |
Employee |
DEPT#20 |
DEPT |
Department row |
DEPT#20 |
EMP#7369 |
Employee |
I insert some data
aws dynamodb put-item --table-name scott --item '{
"PK": {"S": "DEPT#10"},
"SK": {"S": "DEPT"},
"deptno": {"N": "10"},
"dname": {"S": "ACCOUNTING"},
"loc": {"S": "NEW YORK"}
}'
aws dynamodb put-item --table-name scott --item '{
"PK": {"S": "DEPT#10"},
"SK": {"S": "EMP#7782"},
"empno": {"N": "7782"},
"ename": {"S": "CLARK"},
"job": {"S": "MANAGER"},
"sal": {"N": "2450"}
}'
Here is a query that retrieves the whole aggregate by its partition key:
aws dynamodb query \
--table-name scott \
--key-condition-expression "PK = :d" \
--expression-attribute-values '{":d":{"S":"DEPT#10"}}' \
--consistent-read \
--return-consumed-capacity TOTAL
{
"Items": [
{
"deptno": { "N": "10" },
"PK": { "S": "DEPT#10" },
"loc": { "S": "NEW YORK" },
"dname": { "S": "ACCOUNTING" },
"SK": { "S": "DEPT" }
},
{
"ename": { "S": "CLARK" },
"PK": { "S": "DEPT#10" },
"job": { "S": "MANAGER" },
"empno": { "N": "7782" },
"sal": { "N": "2450" },
"SK": { "S": "EMP#7782" }
}
],
"Count": 2,
"ScannedCount": 2,
"ConsumedCapacity": {
"TableName": "scott",
"CapacityUnits": 1.0
}
}
All items are read from a single partition with 1 consistent read capacity unit (RCU), thanks to physical colocation by key design.
MongoDB
MongoDB doesn't need to split documents to get partial reads and writes. The data modeling objective is to keep aggregates in a single document and use the advanced API to access individual items. For example, you can use covering indexes or search indexes with returnStoredSource to avoid reading the whole document. And you can use $push or $set with arrayFilters to update individual array items.
Still, some users preferred to split the documents and, by analogy to DynamoDB, called it the Single Collection Pattern. In truth, it is rarely useful. MongoDB is not DynamoDB. Storing documents in one or multiple collections doesn't impact billing, and using the same key prefix doesn't co‑locate them — except in two cases: clustered collections (special‑purpose, not generally recommended), and sharing a sharding key to co‑locate data on the same shard (but still different blocks in the filesystem).
At the storage block level, documents may be stored together only if they were inserted together.
Although this is not a recommendation when you use the full power of the MongoDB API, here is an example using a single collection and embedding only references, following the documentation: one document per department and one document per employee:
db.empdept.insertMany([
{
_id: "DEPT#10",
doc_type: "dept",
deptno: 10,
dname: "ACCOUNTING",
loc: "NEW YORK",
links: [
{ target: "DEPT#10", doc_type: "dept" },
{ target: "EMP#7782", doc_type: "emp" },
{ target: "EMP#7839", doc_type: "emp" }
]
},{
_id: "EMP#7782",
doc_type: "emp",
empno: 7782,
ename: "CLARK",
job: "MANAGER",
sal: 2450,
deptno: 10,
links: [
{ target: "EMP#7782", doc_type: "emp" },
{ target: "DEPT#10", doc_type: "dept" }
]
},{
_id: "EMP#7839",
doc_type: "emp",
empno: 7839,
ename: "KING",
job: "PRESIDENT",
sal: 5000,
deptno: 10,
links: [
{ target: "EMP#7839", doc_type: "emp" },
{ target: "DEPT#10", doc_type: "dept" }
]
}];
With this schema, the following gets the full aggregate as multiple documents without an aggregation pipeline:
db.empdept.find({ "links.target": "DEPT#10" })
This can identify the document using a single index range with a multi-key index on "links.target", but it will have to fetch multiple small documents, introduces strong coupling between those documents without the possibility of validating the reference, and requires an explicit transaction and retry logic to update them.
MongoDB is designed to store aggregates as a single document, and the right schema is simply:
db.empdept.insertOne({
_id: 10,
dname: "ACCOUNTING",
loc: "NEW YORK",
employees: [
{ empno: 7782, ename: "CLARK", job: "MANAGER", sal: 2450 },
{ empno: 7839, ename: "KING", job: "PRESIDENT", sal: 5000 }
]
})
With this schema, you can still query the full aggregate with a simple find():
db.empdept.find({ _id: 10 })
and you can still update a single employee in‑place:
db.empdept.updateOne(
{ _id: 10, "employees.empno": 7782 },
{ $set: { "employees.$.sal": 2600 } }
)
This updates the first matching array item using the positional $ operator. You can be more precise with arrayFilters
db.empdept.updateOne(
{ _id: 10 },
{ $set: { "employees.$[e].sal": 2600 } },
{ arrayFilters: [{ "e.empno": 7782 }] }
)
I've written about this and measured the efficiency in a previous post:
Emulations
A single document per aggregate, with embedded one-to-many relationships, is efficient in MongoDB, not in emulations. For example, in Oracle Autonomous Database, where the document is stored as OSON, the update will rewrite the full document:
ora> db.empdept.updateOne(
{ _id: 10, "employees.empno": 7782 },
{ $set: { "employees.$.sal": 2600 } }
)
{
acknowledged: true,
insertedId: null,
matchedCount: 1,
modifiedCount: 1,
upsertedCount: 0
}
ora> db.aggregate( [ { $sql : " select * from dbms_xplan.display_cursor( format=>'BASIC') " } ] ).forEach(row => print(row.PLAN_TABLE_OUTPUT));
EXPLAINED SQL STATEMENT:
------------------------
update "ORA"."empdept" set "DATA" = :1 where ("RESID" = :2 ) returning "RESID", "ETAG" into :3 , :4
Plan hash value: 893016358
-----------------------------------------------------
| Id | Operation | Name |
-----------------------------------------------------
| 0 | UPDATE STATEMENT | |
| 1 | UPDATE | empdept |
| 2 | TABLE ACCESS BY INDEX ROWID| empdept |
| 3 | INDEX UNIQUE SCAN | SYS_C0031043 |
-----------------------------------------------------
You have also the possibility to store the collection with JSON-relational duality views (JDV), but this partial update is not supported:
arrayFilters option is not supported on duality view collections
However, even if the JDV doesn't allow partial updates, the aggregate is stored as individual in SQL tables, and users can switch to SQL statements for partial updates.
Duality Views
In relational databases, applications typically work with a logical view of the data model, while the storage layer may transparently cluster or scatter data into fixed-size blocks. Oracle Database (and MySQL) added JSON-Relational Duality Views (JDV) to present a single logical view for JSON documents, while splitting them to multiple SQL tables. As each SQL tables have their own physical segment(s), aggregates are stored across multiple physical blocks.
If you are a fan of the single‑table or single‑collection idea, you may create a single view. However, it has no advantage in data locality, as you explicitly normalized to multiple tables:
CREATE TABLE dept (
deptno NUMBER CONSTRAINT dept_pk PRIMARY KEY,
dname VARCHAR2(14),
loc VARCHAR2(13)
);
CREATE TABLE emp (
empno NUMBER CONSTRAINT emp_pk PRIMARY KEY,
ename VARCHAR2(10),
job VARCHAR2(9),
sal NUMBER(7,2),
deptno NUMBER NOT NULL,
CONSTRAINT emp_dept_fk FOREIGN KEY (deptno) REFERENCES dept(deptno)
);
CREATE INDEX emp_dept_fk ON emp (deptno);
INSERT INTO dept (deptno, dname, loc) VALUES (10, 'ACCOUNTING', 'NEW YORK');
INSERT INTO emp (empno, ename, job, sal, deptno) VALUES (7782, 'CLARK', 'MANAGER', 2450, 10);
INSERT INTO emp (empno, ename, job, sal, deptno) VALUES (7839, 'KING', 'PRESIDENT', 5000, 10);
-- add many departments to check the number of blocks scanned
INSERT INTO dept(deptno) SELECT rownum+1000
FROM xmltable('0 to 999')
;
-- one view over multiple tables with JDV
CREATE OR REPLACE JSON RELATIONAL DUALITY VIEW empdept_dv AS
SELECT JSON {
'_id' : d.deptno,
'dname': d.dname,
'loc' : d.loc,
'employees' : [
SELECT JSON {
'empno': e.empno,
'ename': e.ename,
'job' : e.job,
'sal' : e.sal
}
FROM emp e WITH INSERT UPDATE DELETE
WHERE e.deptno = d.deptno
]
}
FROM dept d WITH INSERT UPDATE DELETE
;
The view looks like a document collection with an embedded one-to-many relationship, similar to MongoDB’s recommended model, but it’s not truly embedded—the view actually splits it into two tables. We have lost the main advantage of MongoDB: data that's accessed together should be stored together.
I can query it as if it were a single table:
SELECT data
FROM empdept_dv
WHERE
JSON_EXISTS("DATA",'$?(@._id.numberOnly() == $B0)' passing 10 as "B0"
type(strict))
;
but it actually reads two tables:
SQL> select * from dbms_xplan.display_cursor(format=>'ALLSTATS LAST -COST');
PLAN_TABLE_OUTPUT
____________________________________________________________________________________________________________________________________________
SQL_ID gfr8jkdwatdnz, child number 0
-------------------------------------
SELECT data FROM empdept_dv WHERE JSON_EXISTS("DATA",'$?(@._id.numberOnly() == $B0)' passing 10 as "B0" type(strict))
Plan hash value: 2755083285
----------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | Buffers | OMem | 1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 | 3 | | | |
| 1 | SORT GROUP BY | | 1 | 1 | 1 | 2 | 2048 | 2048 | 2048 (0)|
| 2 | TABLE ACCESS BY INDEX ROWID BATCHED| EMP | 1 | 1 | 2 | 2 | | | |
|* 3 | INDEX RANGE SCAN | EMP_DEPT_FK | 1 | 1 | 2 | 1 | | | |
| 4 | TABLE ACCESS BY INDEX ROWID | DEPT | 1 | 1 | 1 | 3 | | | |
|* 5 | INDEX UNIQUE SCAN | DEPT_PK | 1 | 1 | 1 | 2 | | | |
----------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("E"."DEPTNO"=:B1)
5 - access("D"."DEPTNO"=10)
The access is efficient, using the index on the foreign key, but it doesn't colocate a one-to-many relationship as we expect when storing an aggregate into a single document, or when using the single-table design.
You can update through the view, but the following will update a lot more than what you expect, as it rewrites the full document:
UPDATE empdept_dv
SET data = JSON_TRANSFORM(
data,
SET '$.employees[0].sal' = 2600
)
WHERE
JSON_EXISTS("DATA",'$?(@._id.numberOnly() == $B0)' passing 10 as "B0"
type(strict))
;
Using duality views instead of OSON storage allows you to fall back to SQL to update a single row, similar to a single-table design. However, unlike DynamoDB, you cannot group different item types in the same table. These are SQL tables with a fixed schema and no polymorphism, so you cannot store both department and employee attributes in a single table. JSON-Relational Duality Views don’t replace document databases: they act like an object-relational mapper (ORM) with all logic deployed in the SQL database.
Cluster
To get closer to the performance of a document database, I need a single-table duality view capable of storing documents that are split across multiple SQL tables, yet behave as if they were in a single physical table. In Oracle Database, this corresponds to a CLUSTER, where only the key is declared, like DynamoDB tables, and an estimated value size are predefined to fill the fixed-size blocks:
CREATE CLUSTER dept_cluster (
deptno NUMBER ------------------------------------- clustering key
)
SIZE 1024 -- expected bytes per cluster key
;
CREATE INDEX dept_cluster_idx ON CLUSTER dept_cluster
;
The logical SQL tables are stored in the cluster under a common clustering key, instead of allocating their own physical segment:
CREATE TABLE dept (
deptno NUMBER CONSTRAINT dept_pk PRIMARY KEY, --- clustering key
dname VARCHAR2(14),
loc VARCHAR2(13)
)
CLUSTER dept_cluster (deptno) ---------------------- clustering key
;
CREATE TABLE emp (
empno NUMBER CONSTRAINT emp_pk PRIMARY KEY,
ename VARCHAR2(10),
job VARCHAR2(9),
sal NUMBER(7,2),
deptno NUMBER NOT NULL, ---------------------- clustering key
CONSTRAINT emp_dept_fk
FOREIGN KEY (deptno)
REFERENCES dept(deptno)
)
CLUSTER dept_cluster (deptno) ---------------------- clustering key
;
INSERT INTO dept (deptno, dname, loc) VALUES (10, 'ACCOUNTING', 'NEW YORK');
INSERT INTO emp (empno, ename, job, sal, deptno) VALUES (7782, 'CLARK', 'MANAGER', 2450, 10);
INSERT INTO emp (empno, ename, job, sal, deptno) VALUES (7839, 'KING', 'PRESIDENT', 5000, 10);
-- add many departments to check the number of blocks scanned
INSERT INTO dept(deptno) SELECT rownum+1000
FROM xmltable('0 to 999')
;
I created the same duality view as before, as the logical tables are the same, and query it:
SELECT data
FROM empdept_dv
WHERE
JSON_EXISTS("DATA",'$?(@._id.numberOnly() == $B0)' passing 10 as "B0"
type(strict))
;
Physically, it accesses the logical tables in the pre-joined cluster, via its index:
SQL> select * from dbms_xplan.display_cursor(format=>'ALLSTATS LAST');
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
SQL_ID 26py2vsmch5kx, child number 0
-------------------------------------
SELECT data FROM <... (truncated)
March 31, 2026
gcc vs clang for sysbench on a small server with Postgres, MySQL and MariaDB
This has results for sysbench on a small server and compares performanc for Postgres, MySQL and MariaDB compiled using clang vs using gcc.
tl;dr
- Throughput with clang and gcc is similar
The tests are run using 1 client and 1 table with 50M rows. The read-heavy microbenchmarks run for 630 seconds and the write-heavy for 930 seconds.
I provide tables below with relative QPS. When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression. The number below are the relative QPS computed as: (QPS with a gcc build / QPS with a clang build)
* pg - for Postgres 18.3, (QPS with gcc / QPS with clang)
* my - for MySQL 8.4.8, (QPS with gcc / QPS with clang)
* ma - for MariaDB 11.8.6, (QPS with gcc / QPS with clang)
-- point queries
pg my ma
1.02 1.00 0.99 hot-points
1.02 0.98 1.02 point-query
0.95 1.01 1.02 points-covered-pk
0.96 1.02 1.02 points-covered-si
0.97 1.00 1.02 points-notcovered-pk
0.96 1.03 1.02 points-notcovered-si
0.97 1.01 1.01 random-points_range=1000
0.98 1.01 1.01 random-points_range=100
1.00 0.99 1.00 random-points_range=10
pg my ma
1.01 0.98 1.03 range-covered-pk
1.00 0.98 1.05 range-covered-si
0.99 0.98 1.04 range-notcovered-pk
0.99 1.02 0.97 range-notcovered-si
1.02 1.06 1.03 scan
-- range queries with aggregation
pg my ma
1.01 0.96 1.05 read-only-count
0.99 0.99 1.01 read-only-distinct
0.99 1.00 1.00 read-only-order
0.99 1.00 1.01 read-only_range=10000
1.00 0.98 1.00 read-only_range=100
1.01 0.97 1.00 read-only_range=10
0.99 0.97 1.03 read-only-simple
1.02 0.98 1.02 read-only-sum
-- writes
pg my ma
1.03 0.98 1.00 delete
1.01 1.00 1.00 insert
1.00 0.98 1.00 read-write_range=100
1.00 0.98 1.00 read-write_range=10
0.99 1.01 0.97 update-index
0.96 1.01 0.99 update-inlist
0.99 0.99 0.99 update-nonindex
1.02 0.98 0.99 update-one
0.98 0.98 0.99 update-zipf
1.00 0.99 0.99 write-only