a curated list of database news from authoritative sources

August 21, 2025

Sysbench for MySQL 5.6 thru 9.4 on a small server

This has performance results for InnoDB from MySQL 5.6.51, 5.7.44, 8.0.43, 8.4.6 and 9.4.0 on a small server with sysbench microbenchmarks. The workload here is cached by InnoDB and my focus is on regressions from new CPU overheads. This work was done by Small Datum LLC and not sponsored. 

tl;dr

  • Low concurrency (1 client) is the worst case for regressions in modern MySQL
  • MySQL 8.0, 8.4 and 9.4 are much slower than 5.6.51 in all but 2 of the 32 microbenchmarks
    • The bad news - performance regressions aren't getting fixed
    • The good news - regressions after MySQL 8.0 are small

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.4.6 and 9.4.0.

The server is an ASUS ExpertCenter PN53 with AMD Ryzen 7 7735HS, 32G RAM and an m.2 device for the database. More details on it are here. The OS is Ubuntu 24.04 and the database filesystem is ext4 with discard enabled.

The my.cnf files are here.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The tests are run using 1 table with 50M rows. The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

Results

All files I saved from the benchmark are here and the spreadsheet is here.

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for MySQL 5.6.51)
When the relative QPS is > 1 then some version is faster than 5.6.51.  When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

Results: point queries

Based on results from vmstat the regressions are from new CPU overheads.
Results: range queries without aggregation

Based on results from vmstat the regressions are from new CPU overheads.
Results; range queries with aggregation

Based on results from vmstat the regressions are from new CPU overheads.
Results: writes

Based on results from vmstat the regressions are from new CPU overheads.


MySQL Router 8.4: How to Deal with Metadata Updates Overhead

It may be surprising when a new InnoDB Cluster is set up, and despite not being in production yet and completely idle, it manifests a significant amount of writes visible in growing binary logs. This effect became much more spectacular after MySQL version 8.4. In this write-up, I will explain why it happens and how to address […]

Cabinet: Dynamically Weighted Consensus Made Fast

This paper (to appear in VLDB'25) proposes a consensus algorithm called "Cabinet", which dynamically adjusts node weights based on responsiveness.

As in the past three weeks, Aleksey and I live-recorded our first blind read and discussion of the paper to capture in real time how experts approach and dissect a paper. I'll link to the video when it's uploaded here. The paper I annotated during our read is available here. Frankly, we regretted reading this paper, as it had big flaws.

Motivation

The paper began by revisiting the role of consensus protocols in distributed systems: ensuring consistent state across replicas. It claimed that consensus often struggles with scalability because majority quorums scale poorly. But this is misleading because it omits that flexible quorums already address this issue. The paper ignores flexible quorums work, and mentions it only with one sentence. Ironically, as we will see later, flexible quorums resurface and undermine all of the paper's premise.

Carrying on with the introduction, the paper confused us by asserting that Google Spanner uses quorums of hundreds of nodes. Wait, wut!? "For instance, in Google’s Spanner's evaluation [15], despite the system scaling from tens to hundreds of nodes, the quorum size keeps growing (e.g., 51 replies in 100 nodes), leading to low performance, especially with high latency, though the probability of half of the nodes failing in the system is exceedingly low."

This is incorrect. Spanner uses replica sets of 3–5 nodes, then relies on two-phase commit (2PC) across their primaries. 2PC requires unanimity ("all Aye") from participants and does not use quorums. So Spanner only employs flat quorums at the replica set level. Unfortunately, the paper doubles down on this misunderstanding, further claiming that Spanner uses hierarchical quorums in the related work section. Wait, wut?! "When scaling a system, sharding—partitioning a large group of nodes into hierarchical quorums—is often applied, as exemplified by Google Spanner [16]. Compared to majority-quorum consensus, hierarchical quorum consensus (HQC) allows consensus decisions to be made simultaneously within groups. Once a consensus decision is reached at the child-level groups, parent groups aggregate these decisions up to the root group (shown in Figure 2)."

The paper seemed desperate to argue that large replica sets are used in practice and majority quorums fail to scale, since that would motivate the need for weighted consensus. They even cited a 2004 survey for this claim (really, 2004?) but could not identify any large replicaset deployments in use in practice. That is  not surprising, because it does not make much sense to have large size replicasets. F>2 is uncommon and large sized replicasets are not economical. Some geo-replication papers mention large quorums, but the remote replicas usually act as learners, not as quorum participants. Again, those are just papers, not practically deployed systems. The largest quorum I know of is from Physalia, which uses N=7 and quorum size 4. Even there, the choice was made to reduce blast radius and improve partition tolerance, and the cost was justified because a Physalia deployment serves as configuration for many chain-replication groups at once.

So Cabinet's motivation of framing inefficiency as stemming solely from majority quorums is simplistic and unconvincing. Even in theoretically plausible large deployments, easier mitigations exist. For example, communication overlays, such as those we proposed in PigPaxos (2020), can address scaling bottlenecks directly.


Weighted Consensus Algorithm

Let's move past the shaky motivation and examine the algorithm itself. Cabinet introduces a weighted consensus approach for "fast agreement in heterogeneous systems." The paper allows a customizable failure threshold t, just as flexible Paxos does, and requires only t+1 responses to commit a decision, again, just as flexible Paxos does.

The paper recasts the quorum intersection property in terms of weights. There has been a lot of weighted voting work in the 1980s, and the paper draws on them. Traditional weighted voting sets the consensus threshold (CT) at half of the total weight. But, the paper points out that this alone cannot guarantee both safety and liveness. To fix this, the authors propose two invariants:

  • I1: The sum of the t+1 highest weights (the cabinet members) must be greater than the consensus threshold (CT).
  • I2: The sum of the t highest weights (cabinet members minus the lowest member) must be less than the consensus threshold.

In other words, a valid weight scheme must satisfy Equation 2.

Notice that this is just recasting the quorum intersection property in terms of weights. Neat. The weight allocation mechanism was the most interesting part of the paper. Cabinet observes that by assigning weights using geometric sequences, it can satisfy Equation 2 formulaically. Specifically, Cabinet applies geometric sequences for a given t (where 1 ≤ t ≤ (n−1)/2). With a common ratio 1 < r < 2, the sequence increases monotonically but never lets any single term exceed the sum of prior terms, preserving quorum intersection.



Initially, nodes are assigned weights in descending order of their IDs. Higher IDs get lower initial weights. These initial weights are then redistributed dynamically based on node responsiveness. Importantly, no new weights are created; the leader merely redistributes existing ones among nodes. Cabinet dynamically reassigns weights according to node responsiveness, prioritizing faster nodes as cabinet members. The paper claims that fast nodes are also reliable, but its only justification is again that 2004 survey which is hardly convincing.

The idea behind dynamic reassignment is to adapt to performance fluctuations quickly and get to the quorum weight threshold with low latency, but this introduces significant overhead. The paper mentions the weights need to be communicated and maintained through log replication/communication, which leads to logging overhead. Worse, the reweighting is performed for every consensus instance. (What puzzles me is how this could possibly interact with pipelining in MultiPaxos and Raft, but that is another story.) That seems overkill and unnecessary.

Algorithm 2 shows the mechanics. The leader maintains a weight clock (W_clock) and assigns weight values (W_i) per node to track consensus rounds and update weights. I guess the reasoning is that this must be logged and communicated to followers to ensure safety during leader changes. Otherwise, a newly elected leader might form non-intersecting quorums and violate safety. The result is weight metadata stored with every message. Uff. (Again, how does this reconcile with pipelined/Multi-Paxos style communication?)


Flexible Quorums with a vengeance

This is where things fall apart. In Section 4.1.3.

This subsection says that, Cabinet adopts Raft's leader election mechanism but does not apply weighting there. For simplicity, weighted consensus is used only in replication. But the paper continues, to adjust for t+1 quorums, Raft's leader election quorum size should be set to n-t: a candidate must receive votes from n-t nodes to win.

But, this is exactly how flexible quorums work. If phase-2 (commit) quorums are size t+1, then phase-1 (leader election) quorums must be size n-t. Cabinet reduces to flexible quorums. Unfortunately, there are multiple things falling apart with this.

This undercuts the entire framing. The paper claims tolerance of at least t failures in the worst case and up to n-t-1 in the best case. But leader election forces the system into the flexible quorum model, where fault tolerance is exactly defined to be at most t, and the range promised by the Cabinet is gone. 

Here is another kicker: flexible Paxos already uses the fastest t+1 responses, without bothering with weights. It doesn't need to guess who the fastest nodes will be; it simply waits for the quickest t+1 replies. All nodes are equal, no weight-tracking needed, so no overhead incurred in tracking and communicating that through the logs. So what was the point of all this? Defaulting to flexible Paxos simplifies everything and is more efficient. No weight bookkeeping, no per-slot weight metadata. Cabinet's methods add complexity and overhead without providing benefits beyond flexible Paxos.

Finally, why does Cabinet focus only on making replication/commit quorums intersect? The answer seems to be that it missed the key lesson from the flexible quorum paper. Flexible quorums weakened Paxos's requirement that all quorums intersect to the more efficient rule that only phase-1 (Q1) and phase-2 (Q2) quorums must intersect. In other words, it is enough for any Q1 quorum (used for election) to intersect with any Q2 quorum (used for commits), and Q2 quorums themselves do not need to intersect with each other. Cabinet, however, spent considerable effort designing a weighted system to ensure intersection among all Q2 quorums, only to later fall back on flexible quorum logic in Section 4.1.3. As a result, the work on enforcing Q2–Q2 intersection ended up achieving nothing.

Let's do the crosscheck on this. Cabinet requires maintaining and logging weights for every consensus instance. Yet Raft's strong leader property already ensures that a newly elected leader cannot overwrite previously committed values in a disjoint Q2-quorum, regardless of quorum composition. Even if the new elected leader produces a non-intersecting quorum, safety is preserved because Raft enforces log continuity by selecting the node with the longest log in the Q1 quorum (and the extra condition for Flexible-Raft is that Q1 quorums should also intersect). Under this model, storing and propagating weight assignments per slot becomes unnecessary overhead. (Paxos elected leader learning all previous values before going to multipaxos mode also achieves the same goal.) The insistence on tracking weights appears rooted in a misunderstanding of how leader election and log safety interact. This undermines the central justification for Cabinet;s complexity, revealing that the elaborate weight bookkeeping brings no tangible benefit.


Can this be salvaged?

During our blind read, we initially thought the paper might explore probabilistic failure detection and weighted fault probabilities. That would have been a more compelling direction. Recall the HotOS'25 paper we reviewed: "Real Life Is Uncertain. Consensus Should Be Too!" Pivoting in that direction could rescue some of these ideas. https://muratbuffalo.blogspot.com/2025/07/real-life-is-uncertain-consensus-should.html

Another promising salvage path could be non-leader linearizable reads. If weights were cached and relatively stable, they could be exploited to form smaller read quorums (in the style of our PQR work), avoiding the need for majorities or full Q2 complements. This could deliver practical efficiency gains, especially in read-heavy workloads.

Finally, it might be worth exploring weight-encoding mechanisms that explicitly target Q1–Q2 quorum intersection. This would likely involve assigning two weight sets per node:one for Q1 participation and another for Q2. Such a design could provide a cleaner, purpose-built use of weights rather than forcing them into roles already covered by flexible quorums.

CockroachDB and CedarDB: Better Together

CockroachDB and CedarDB: Better Together

Even simple and innocent looking queries can prove quite challenging for industry-leading transactional databse systems. Take the innocent task of finding the the 10 top-grossing items, along with how much we sold, how much money they made, what we usually charge per unit, how often they ship from a remote warehouse, and how much stock is still left in the warehouse.

August 20, 2025

Deep Diving the Citus Distribution Models Along with Shard Balancing/Read Scaling

In my previous blog post, Integrating Citus with Patroni: Sharding and High Availability Together, I explored how to integrate Citus with Patroni and demonstrated how basic table distribution works. In this follow-up post, I will discuss various other Citus distribution models. We will also explore how shard rebalancing and data movement are handled and further […]

August 19, 2025

Vibe code with AWS databases using Vercel v0

In this post, we explore how you can use Vercel’s v0 generative UI to build applications with a modern UI for AWS purpose-built databases such as Amazon Aurora, Amazon DynamoDB, Amazon Neptune, and Amazon ElastiCache.

Secure, Centralized Authentication Comes to Percona Server for MongoDB with OpenID Connect

Today, Percona is proud to announce the release of OpenID Connect (OIDC) support for Percona Server for MongoDB, the source-available, enterprise-grade MongoDB-compatible database solution trusted by developers and IT leaders globally. With this new capability, Percona customers can integrate with leading identity providers (IdPs) like Okta, Microsoft Entra, Ping Identity, Keycloak, and others to simplify […]

August 18, 2025

Demystifying the AWS advanced JDBC wrapper plugins

In 2023, AWS introduced the AWS advanced JDBC wrapper, enhancing the capabilities of existing JDBC drivers with additional functionality. This wrapper enables support of AWS and Amazon Aurora functions on top of an existing PostgreSQL, MySQL, or MariaDB JDBC driver of your choice. This wrapper supports a variety of plugins, including the Aurora connection tracker plugin, the limitless connection plugin, and the read-write splitting plugin. In this post, we discuss the benefits, use cases, and implementation details for two popular AWS Advanced JDBC Wrapper Driver plugins: the Aurora Initial Connection Strategy and Failover v2 plugins.

MongoDB arrays: sort order and comparison

Arrays in MongoDB are ordered lists of values, and querying them can be confusing because a field might be a scalar value in one document and an array in another. Unlike SQL databases, there's no need to choose between one-to-one, one-to-many, or many-to-many relationships once for all future data. Within the limits of schema validation, each document can have a unique schema, allowing fields to be any data type, scalar or array. When indexing or querying a collection, field names are used, and operations like filtering or sorting must work for all documents, regardless of its data type. This flexibility can lead to behavior that may seem strange to developers accustomed to the strict schemas of SQL databases.

  • Sort and Comparison in SQL databases
  • MongoDB and Arrays (comparison and sort)
  • Another example of MongoDB arrays (sort on min/max)
  • Comparison with PostgreSQL
  • Sort semantics is close to how indexes work
  • Conclusion and how to remind it

Sort and Comparison in SQL databases

I've heard users familiar with PostgreSQL being surprised to learn that MongoDB employs a different semantic approach when it comes to sorting and comparing data. For instance, when sorting fruits alphabetically in ascending order, 'Apple' appears before 'Banana' and you also expect 'Apple' > 'Banana' to be true. I was not so surprised because I know other databases and other languages. For example, I create a table of fruits in Oracle Autonomous Database in Zurich:

SQL> CREATE TABLE fruits_de (
  2      icon NVARCHAR2(4),
  3      name NVARCHAR2(50)
  4* );

Table FRUITS_DE erstellt.

SQL> INSERT INTO fruits_de (icon, name) VALUES
  2    (N'🍌', N'Banane'),
  3    (N'🍏', N'Äpfel'),
  4    (N'🍇', N'Traube'),
  5*   (N'🍊', N'Orange');

4 Zeilen eingefügt.

SQL> COMMIT;
Commit abgeschlossen.

A sort shows the apple (Äpfel) first:

SQL> SELECT icon, name 
     FROM fruits_de 
     ORDER BY name
;

ICON    NAME
_______ _________
🍏      Äpfel
🍌      Banane
🍊      Orange
🍇      Traube

If comparison and sort worked the same, I would expect a filter on name > 'Banane' to show only Orange and Traube, but Äpfel is still there:

SQL> SELECT icon, name FROM fruits_de WHERE name >'Banane' order by name ;

ICON    NAME
_______ _________
🍏      Äpfel
🍊      Orange
🍇      Traube

By default, NLS_SORT determines ordering based on language (in German, 'Ä' is treated as 'Ae'), while NLS_COMP, defaulting to binary, dictates comparison (in UTF8, 'Ä' follows all ASCII characters).

In Oracle Database, this difference happens with text, as it depends on the language, and can be configured. With text, MongoDB is consistent by default between comparison and sorting:

db.fruits_de.insertMany([
      { icon: "🍌", name: "Banane" },  { icon: "🍏", name: "Äpfel" },  { icon: "🍇", name: "Traube" },  { icon: "🍊", name: "Orange" }
    ]);

db.fruits_de.find(
  { _id:0 }
).collation({ locale: "de", strength: 1 }).sort({ name: 1 })
      { }  ,
      { _id:0 }
    ).collation({ locale: "de", strength: 1 }).sort({ name: 1 })

[
  { icon: '🍏', name: 'Äpfel' },
  { icon: '🍌', name: 'Banane' },
  { icon: '🍊', name: 'Orange' },
  { icon: '🍇', name: 'Traube' }
]

db.fruits_de.find(
  { name: { $gt: "Banane" } }  ,
  { _id:0 }
).collation({ locale: "de", strength: 1 }).sort({ name: 1 })

[ { icon: '🍊', name: 'Orange' }, { icon: '🍇', name: 'Traube' } ]

MongoDB supports arrays that require specific definitions for ordering and comparison. Is [Cherry, Apple] greater or less than [Banana, Orange]? Which comes first in ascending order: the one that starts with Banana (which is before Cherry), or the one with Apple (which is before both Banana and Orange)?

MongoDB and Arrays (comparison and sort)

I have a collection of fruits, identified by a text name ("txt" field) and an indexed array of characters ("arr" field):

db.fruits.createIndex({txt:1});

db.fruits.createIndex({arr:1});

db.fruits.insertMany([
  { _id: "🍏", txt: "apple",              arr: ["a","p","p","l","e"] },
  { _id: "🍐", txt: "pear",               arr: ["p","e","a","r"] },
  { _id: "🍊", txt: "tangerine",          arr: ["t","a","n","g","e","r","i","n","e"] },
  { _id: "🍋", txt: "lemon",              arr: ["l","e","m","o","n"] },
  { _id: "🍌", txt: "banana",             arr: ["b","a","n","a","n","a"] },
  { _id: "🍉", txt: "watermelon",         arr: ["w","a","t","e","r","m","e","l","o","n"] },
  { _id: "🍇", txt: "grapes",             arr: ["g","r","a","p","e","s"] },
  { _id: "🍓", txt: "strawberry",         arr: ["s","t","r","a","w","b","e","r","r","y"] },
  { _id: "🫐", txt: "blueberries",        arr: ["b","l","u","e","b","e","r","r","i","e","s"] },
  { _id: "🍈", txt: "melon",              arr: ["m","e","l","o","n"] },
  { _id: "🍒", txt: "cherries",           arr: ["c","h","e","r","r","i","e","s"] },
  { _id: "🍑", txt: "peach",              arr: ["p","e","a","c","h"] },
  { _id: "🥭", txt: "mango",              arr: ["m","a","n","g","o"] },
  { _id: "🍍", txt: "pineapple",          arr: ["p","i","n","e","a","p","p","l","e"] },
  { _id: "🥥", txt: "coconut",            arr: ["c","o","c","o","n","u","t"] },
  { _id: "🥝", txt: "kiwi fruit",         arr: ["k","i","w","i"," ","f","r","u","i","t"] }
]);

Here are the documents ordered by their identifier:

mdb> db.fruits.find( ).
              sort( { "_id": 1 } ).
              forEach(doc => {  print(JSON.stringify(doc));  })

{"_id":"🍇","txt":"grapes","arr":["g","r","a","p","e","s"]}
{"_id":"🍈","txt":"melon","arr":["m","e","l","o","n"]}
{"_id":"🍉","txt":"watermelon","arr":["w","a","t","e","r","m","e","l","o","n"]}
{"_id":"🍊","txt":"tangerine","arr":["t","a","n","g","e","r","i","n","e"]}
{"_id":"🍋","txt":"lemon","arr":["l","e","m","o","n"]}
{"_id":"🍌","txt":"banana","arr":["b","a","n","a","n","a"]}
{"_id":"🍍","txt":"pineapple","arr":["p","i","n","e","a","p","p","l","e"]}
{"_id":"🍏","txt":"apple","arr":["a","p","p","l","e"]}
{"_id":"🍐","txt":"pear","arr":["p","e","a","r"]}
{"_id":"🍑","txt":"peach","arr":["p","e","a","c","h"]}
{"_id":"🍒","txt":"cherries","arr":["c","h","e","r","r","i","e","s"]}
{"_id":"🍓","txt":"strawberry","arr":["s","t","r","a","w","b","e","r","r","y"]}
{"_id":"🥝","txt":"kiwi fruit","arr":["k","i","w","i"," ","f","r","u","i","t"]}
{"_id":"🥥","txt":"coconut","arr":["c","o","c","o","n","u","t"]}
{"_id":"🥭","txt":"mango","arr":["m","a","n","g","o"]}
{"_id":"🫐","txt":"blueberries","arr":["b","l","u","e","b","e","r","r","i","e","s"]}

The identifiers are in binary order of their Unicode point:
🍇 U+1F347 < 🍈 U+1F348 < 🍉 U+1F349 < 🍊 U+1F34A < 🍋 U+1F34B < 🍌 U+1F34C < 🍍 U+1F34D < 🍏 U+1F34F < 🍐 U+1F350 < 🍑 U+1F351 < 🍒 U+1F352 < 🍓 U+1F353 < 🥝 U+1F95D < 🥥 U+1F965 < 🥭 U+1F96D < 🫐 U+1FAD0

Here are the documents ordered by their name in text:

mdb> db.fruits.find( ).
              sort( { "txt": 1 } ).
              forEach(doc => {  print(JSON.stringify(doc));  })

{"_id":"🍏","txt":"apple","arr":["a","p","p","l","e"]}
{"_id":"🍌","txt":"banana","arr":["b","a","n","a","n","a"]}
{"_id":"🫐","txt":"blueberries","arr":["b","l","u","e","b","e","r","r","i","e","s"]}
{"_id":"🍒","txt":"cherries","arr":["c","h","e","r","r","i","e
                                    
                                    
                                    
                                    
                                

MongoDB arrays: sort order and comparison

Arrays in MongoDB are ordered lists of values, and querying them can be confusing because a field might be a scalar value in one document and an array in another. Unlike SQL databases, there's no need to choose between one-to-one, one-to-many, or many-to-many relationships once for all future data. Within the limits of schema validation, each document can have a unique schema, allowing fields to be any data type, scalar or array. When indexing or querying a collection, field names are used, and operations like filtering and sorting must work for all documents, regardless of their data types. This flexibility can lead to behavior that may seem strange to developers accustomed to the strict schemas of SQL databases.

This post covers the following:

  • Sort and comparison in SQL databases
  • MongoDB and arrays (comparison and sort)
  • Another example of MongoDB arrays (sort on min/max)
  • Comparison with PostgreSQL
  • Sort semantics is close to how indexes work
  • Conclusion

Sort and comparison in SQL databases

I've heard users familiar with PostgreSQL being surprised to learn that MongoDB employs a different semantic approach when it comes to sorting and comparing data. For instance, when sorting fruits alphabetically in ascending order, 'Apple' appears before 'Banana' and you also expect 'Apple' > 'Banana' to be true. I was not so surprised because I know other databases and other languages. For example, I create a table of fruits in Oracle Autonomous Database in Zurich:

SQL> CREATE TABLE fruits_de (
  2      icon NVARCHAR2(4),
  3      name NVARCHAR2(50)
  4* );

Table FRUITS_DE erstellt.

SQL> INSERT INTO fruits_de (icon, name) VALUES
  2    (N'🍌', N'Banane'),
  3    (N'🍏', N'Äpfel'),
  4    (N'🍇', N'Traube'),
  5*   (N'🍊', N'Orange');

4 Zeilen eingefügt.

SQL> COMMIT;
Commit abgeschlossen.

A sort shows the apple (Äpfel) first:

SQL> SELECT icon, name 
     FROM fruits_de 
     ORDER BY name
;

ICON    NAME
_______ _________
🍏      Äpfel
🍌      Banane
🍊      Orange
🍇      Traube

If comparison and sort worked the same, I would expect a filter on name > 'Banane' to show only Orange and Traube, but Äpfel is still there:

SQL> SELECT icon, name FROM fruits_de WHERE name >'Banane' order by name ;

ICON    NAME
_______ _________
🍏      Äpfel
🍊      Orange
🍇      Traube

By default, NLS_SORT determines ordering based on language (in German, 'Ä' is treated as 'Ae'), while NLS_COMP, defaulting to binary, dictates comparison (in UTF8, 'Ä' follows all ASCII characters).

In Oracle Database, this difference happens with text, as it depends on the language, and can be configured (the default is there for historical reason and better performance). With text, MongoDB is consistent by default between comparison and sorting:

db.fruits_de.insertMany([
      { icon: "🍌", name: "Banane" },  { icon: "🍏", name: "Äpfel" },  { icon: "🍇", name: "Traube" },  { icon: "🍊", name: "Orange" }
    ]);

db.fruits_de.find(
  { _id:0 }
).collation({ locale: "de", strength: 1 }).sort({ name: 1 })
      { }  ,
      { _id:0 }
    ).collation({ locale: "de", strength: 1 }).sort({ name: 1 })

[
  { icon: '🍏', name: 'Äpfel' },
  { icon: '🍌', name: 'Banane' },
  { icon: '🍊', name: 'Orange' },
  { icon: '🍇', name: 'Traube' }
]

db.fruits_de.find(
  { name: { $gt: "Banane" } }  ,
  { _id:0 }
).collation({ locale: "de", strength: 1 }).sort({ name: 1 })

[ { icon: '🍊', name: 'Orange' }, { icon: '🍇', name: 'Traube' } ]

When the field that is sorted or compared is an array, there's not one obvious rule that can be used for sorting and comparing. Is [Cherry, Apple] greater or less than [Banana, Orange]? Which comes first in ascending order: the one that starts with Banana (which is before Cherry), or the one with Apple (which is before both Banana and Orange)?

MongoDB and arrays (comparison and sort)

I have a collection of fruits, identified by a text name ("txt" field) and an indexed array of characters ("arr" field):

db.fruits.createIndex({txt:1});

db.fruits.createIndex({arr:1});

db.fruits.insertMany([
  { _id: "🍏", txt: "apple",              arr: ["a","p","p","l","e"] },
  { _id: "🍐", txt: "pear",               arr: ["p","e","a","r"] },
  { _id: "🍊", txt: "tangerine",          arr: ["t","a","n","g","e","r","i","n","e"] },
  { _id: "🍋", txt: "lemon",              arr: ["l","e","m","o","n"] },
  { _id: "🍌", txt: "banana",             arr: ["b","a","n","a","n","a"] },
  { _id: "🍉", txt: "watermelon",         arr: ["w","a","t","e","r","m","e","l","o","n"] },
  { _id: "🍇", txt: "grapes",             arr: ["g","r","a","p","e","s"] },
  { _id: "🍓", txt: "strawberry",         arr: ["s","t","r","a","w","b","e","r","r","y"] },
  { _id: "🫐", txt: "blueberries",        arr: ["b","l","u","e","b","e","r","r","i","e","s"] },
  { _id: "🍈", txt: "melon",              arr: ["m","e","l","o","n"] },
  { _id: "🍒", txt: "cherries",           arr: ["c","h","e","r","r","i","e","s"] },
  { _id: "🍑", txt: "peach",              arr: ["p","e","a","c","h"] },
  { _id: "🥭", txt: "mango",              arr: ["m","a","n","g","o"] },
  { _id: "🍍", txt: "pineapple",          arr: ["p","i","n","e","a","p","p","l","e"] },
  { _id: "🥥", txt: "coconut",            arr: ["c","o","c","o","n","u","t"] },
  { _id: "🥝", txt: "kiwi fruit",         arr: ["k","i","w","i"," ","f","r","u","i","t"] }
]);

Here are the documents ordered by their identifier:

mdb> db.fruits.find( ).
              sort( { "_id": 1 } ).
              forEach(doc => {  print(JSON.stringify(doc));  })

{"_id":"🍇","txt":"grapes","arr":["g","r","a","p","e","s"]}
{"_id":"🍈","txt":"melon","arr":["m","e","l","o","n"]}
{"_id":"🍉","txt":"watermelon","arr":["w","a","t","e","r","m","e","l","o","n"]}
{"_id":"🍊","txt":"tangerine","arr":["t","a","n","g","e","r","i","n","e"]}
{"_id":"🍋","txt":"lemon","arr":["l","e","m","o","n"]}
{"_id":"🍌","txt":"banana","arr":["b","a","n","a","n","a"]}
{"_id":"🍍","txt":"pineapple","arr":["p","i","n","e","a","p","p","l","e"]}
{"_id":"🍏","txt":"apple","arr":["a","p","p","l","e"]}
{"_id":"🍐","txt":"pear","arr":["p","e","a","r"]}
{"_id":"🍑","txt":"peach","arr":["p","e","a","c","h"]}
{"_id":"🍒","txt":"cherries","arr":["c","h","e","r","r","i","e","s"]}
{"_id":"🍓","txt":"strawberry","arr":["s","t","r","a","w","b","e","r","r","y"]}
{"_id":"🥝","txt":"kiwi fruit","arr":["k","i","w","i"," ","f","r","u","i","t"]}
{"_id":"🥥","txt":"coconut","arr":["c","o","c","o","n","u","t"]}
{"_id":"🥭","txt":"mango","arr":["m","a","n","g","o"]}
{"_id":"🫐","txt":"blueberries","arr":["b","l","u","e","b","e","r","r","i","e","s"]}

The identifiers are in binary order of their Unicode point:
🍇 U+1F347 < 🍈 U+1F348 < 🍉 U+1F349 < 🍊 U+1F34A < 🍋 U+1F34B < 🍌 U+1F34C < 🍍 U+1F34D < 🍏 U+1F34F < 🍐 U+1F350 < 🍑 U+1F351 < 🍒 U+1F352 < 🍓 U+1F353 < 🥝 U+1F95D < 🥥 U+1F965 < 🥭 U+1F96D < 🫐 U+1FAD0

Here are the documents ordered by their name in text:

mdb> db.fruits.find( ).
              sort( { "txt": 1 } ).
              forEach(doc => {  print(JSON.stringify(doc));  })

{"_id":"🍏","txt":"apple","arr":["a","p","p","l","e"]}
{"_id":"🍌","txt":"banana","arr":["b","a","n","a","n","a"]}
{"_id":"🫐","txt":"blueberries","arr":["b","l","u","e","b","e","r","r","i","e","s"]}
{"_id":"🍒","txt":"cherries","arr":["c","h","e","r","r","
                                    
                                    
                                    
                                    
                                

August 17, 2025

The Vibe Coder's Guide to Supabase Environments

Build a professional deployment workflow for your Supabase project. Learn essential patterns that prevent 3am panic attacks while keeping your workflow fun, simple, and safe.

August 16, 2025

The Vibe Coding Master Checklist

Get your AI-generated app ready for production with this comprehensive guide covering security, performance, and deployment best practices.

Vibe Coding: Best Practices for Prompting

Master the art of communicating with AI coding assistants through effective prompting strategies, iterative refinement, and systematic approaches that turn ideas into deployable applications.

Efficient COUNT, SUM, MAX with the Aggregate Component

Convex omits built-in aggregates because full-table scans don’t scale; the video shows how @convex-dev/aggregate (B-Tree powered) enables fast pagination, ranking, per-user stats, and randomization with fully reactive queries. It also covers keeping aggregates in sync via triggers/custom functions, backfilling with migrations, and the trade-offs that hint at possible platform-level support.

August 15, 2025

Securing Amazon Aurora DSQL: Access control best practices

You can access an Amazon Aurora DSQL cluster by using a public endpoint and AWS PrivateLink endpoints. In this post, we demonstrate how to control access to your Aurora DSQL cluster by using public endpoints and private VPC endpoints through PrivateLink, both from inside and outside AWS.

Why doesn't Oracle Multi-Value Index optimize .sort() like MongoDB does with its multi-key index?

Until recently, Oracle Database offered very limited options for creating inverted indexes, primarily restricted to special indexing for text and spatial data. Unlike PostgreSQL with Generalized Inverted Index (GIN), Oracle lacked a general solution. However, in response to the growing demand for document databases and to provide a MongoDB emulation, Oracle introduced the Multi-Value Index (MVI) in version 23ai. Let's create one and examine its internal structure.

Lab setup

I start a docker container with Oracle 23ai running and connect with SQL*Plus:

docker run --name ora -d -e ORACLE_PWD=OracleDatabase_23ai container-registry.oracle.com/database/free:latest

docker exec -it ora bash -c 'until grep -B100 "Completed: ALTER DATABASE OPEN" /opt/oracle/diag/rdbms/free/FREE/trace/alert_FREE.log ; do sleep 1 ; done'

docker exec -it ora sqlplus / as sysdba

I create a table to store some JSON documents:

DROP TABLE if exists franck;
CREATE TABLE franck (data JSON);

INSERT INTO franck VALUES
  ('{ field1: "x", field2: [ "jkl" , "mno" , "jkl" ] }') -- 78 , [ 6a 6b 6c , 6d 6e 6f , 6a 6b 6c ]
 ,('{ field1: "y", field2: [ "mno" , "jkl"         ] }') -- 79 , [ 6d 6e 6f , 6a 6b 6c ]
 ,('{ field1: "z", field2:   "jkl"                   }') -- 7a ,   6a 6b 6c
);

The structure is flexible, and for the second field, I inserted scalar (in the third document) and arrays, some with redundant values, and with different orders.

Multi-Value Index

I create a multi-value index similar to what the MongoDB API would create. It uses JSON_TABLE to extract fields to columns to be indexed. It uses JSON_MKMVI to make an index entry for the multi-value index:

CREATE MULTIVALUE INDEX FRANCK_MVI ON FRANCK (
 JSON_MKMVI(
  JSON_TABLE(
    "DATA", '$' COLUMNS (
     "SCALAR" PATH '$."field1"' ERROR ON ERROR NULL ON EMPTY NULL ON MISMATCH,
     NESTED PATH '$."field2"[*]' COLUMNS (
       "K0" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR NULL ON EMPTY NULL ON MISMATCH
     )
    )
  )
 )
);

ORA_RAWCOMPARE is a clue that the ordering of nested values is binary. It would be nice to run those functions and see how they make index entries from JSON, but we cannot use them directly:

  select JSON_TABLE(
         *
ERROR at line 2:
ORA-61712: Can't use json_table syntax to create non-MVI(MultiValue Index): .
Help: https://docs.oracle.com/error-help/db/ora-61712/

The index entries will have a compound key with values extracted from the DATA column:

  • field1 as a scalar value (not used by the MongoDB emulation as all fields can be an array)
  • field2[*] as the values within an array, and NESTED PATH unnests the values (like $unwind)

The indexes created by the MongoDB API emulation are all NESTED PATH to be compatible with MongoDB's flexible schema, where any field in a document can be a scalar or an array, but I created one scalar to see the behavior.

Index internals

The index is created, making it easy to dump what's inside. I get the block identifier from DBA_SEGMENTS and dump the block:

SQL> -- get the block offset in datafile 1 ( I run this as sysdba, it goes so SYSTEM tablespace )
SQL> column block new_value block
SQL> select header_block+1 as block from dba_segments where segment_name='FRANCK_MVI' and owner=user;

     BLOCK
----------
    114553

SQL> -- set tracefile identifier and dump the block
SQL> alter session set tracefile_identifier = 'franck&block';
old   1: alter session set tracefile_identifier = 'franck&block'
new   1: alter session set tracefile_identifier = 'franck    114553'

Session altered.

SQL> -- get tracefile name and print
SQL> column tracefile new_value tracefile
SQL> select tracefile from v$process where addr=(select paddr from v$session where sid=sys_context('userenv','sid'));

TRACEFILE
--------------------------------------------------------------------------------
/opt/oracle/diag/rdbms/free/FREE/trace/FREE_ora_3518_franck    114553.trc

SQL> host cat "&tracefile."
Trace file /opt/oracle/diag/rdbms/free/FREE/trace/FREE_ora_3518_franck    114553.trc
Oracle Database 23ai Free Release 23.0.0.0.0 - Develop, Learn, and Run for Free
Version 23.8.0.25.04
Build label:    RDBMS_23.8.0.25.04DBRU_LINUX.X64_250423
ORACLE_HOME:    /opt/oracle/product/23ai/dbhomeFree
System name:    Linux
Node name:      6fc22939038c
Release:        4.14.35-2047.540.4.1.el7uek.x86_64
Version:        #2 SMP Thu Sep 5 12:06:13 PDT 2024
Machine:        x86_64
VM name:        KVM CPUID feature flags: 0x01003afb
CLID:   P
Instance name: FREE
Instance number: 1
Database name: FREE
Database unique name: FREE
Database id: 1471336774
Database role: PRIMARY
Redo thread mounted by this instance: 1
Oracle binary inode:    1036542
Oracle process number: 63
Unix process pid:       3518
             NID:       4026532223
             created:   2025-08-15T09:12:36.-2739213311+00:00
             image:     oracle@6fc22939038c (TNS V1-V3)


*** 2025-08-15T09:13:47.601001+00:00 (CDB$ROOT(1))
*** SESSION ID:(213.17644) 2025-08-15T09:13:47.601014+00:00
*** CLIENT ID:() 2025-08-15T09:13:47.601024+00:00
*** SERVICE NAME:(SYS$USERS) 2025-08-15T09:13:47.601033+00:00
*** MODULE NAME:(sqlplus@6fc22939038c (TNS V1-V3)) 2025-08-15T09:13:47.601043+00:00
*** ACTION NAME:() 2025-08-15T09:13:47.601052+00:00
*** CLIENT DRIVER:(SQL*PLUS) 2025-08-15T09:13:47.601060+00:00
*** CONTAINER ID:(1) 2025-08-15T09:13:47.601069+00:00
*** CLIENT IP:(N/A) 2025-08-15T09:13:47.601078+00:00
*** CONNECTION ID:(PGTKhibdDbfgYwIAEaztFA==) 2025-08-15T09:13:47.601088+00:00

Start dump data blocks tsn: 0 file#:1 minblk 114553 maxblk 114553
Block dump from cache:

Dump of buffer cache at level 3 for pdb=1 tsn=0 rdba=114553

BH (0xbdf5b740) file#: 1 rdba: 0x0001bf79 (1024/114553) class: 1 ba: 0xbd134000
  set: 6 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,0
  dbwrid: 0 obj: 72614 objn: 72614 tsn: [1/0] afn: 1 hint: 0xf
  hash: [0xc3f52b50,0x77de4520] lru: [0xbdf5d2e0,0xbdf5b990]
  ckptq: [NULL] fileq: [NULL]
  objq: [0x76c0c790,0x76c0c790] objaq: [0x76c0c780,0x76c0c780] qhead: 0x76c0c770
  st: XCURRENT md: NULL fpin: 'ktswh23: ktsfbkl' fscn: 0x2f14be fcur_vldr: 0x3 tch: 1
  flags: block_written_once
  LRBA: [0x0.0.0] LSCN: [0x0] HSCN: [0x2f14c0] HSUB: [1]
  Printing buffer operation history (latest change first): (cnt 9)
  01. sid:11 L192:kcbbic2:bic:FBD     02. sid:11 L191:kcbbic2:bic:FBW
  03. sid:11 L602:bic1_int:bis:FWC    04. sid:11 L822:bic1_int:ent:rtn
  05. sid:11 L832:oswmqbg1:clr:WRT    06. sid:11 L930:kubc:sw:mq
  07. sid:11 L913:bxsv:sw:objq        08. sid:11 L608:bxsv:bis:FBW
  09. sid:11 L607:bxsv:bis:FFW        10. sid:05 L464:chg1_mn:bic:FMS
  11. sid:05 L778:chg1_mn:bis:FMS     12. sid:05 L552:chg_main:bic:CLN
  13. sid:05 L353:gcur:set:MEXCL      14. sid:05 L353:gcur:set:MEXCL

  buffer tsn: 0 rdba: 0x0001bf79 (1024/114553)
  scn: 0x2f14c0 seq: 0x01 flg: 0x04 tail: 0x14c00601
  frmt: 0x02 chkval: 0x2a1a type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
Dumping 'block' addr=0xbd134000 size=8192 bytes
Dump of memory from 0xbd134000 to 0xbd136000
0BD134000 0000A206 0001BF79 002F14C0 04010000  [....y...../.....]
0BD134010 00002A1A 00400102 00011BA6 002F14BE  [.*....@......./.]
0BD134020 00008000 00020002 00000000 00000000  [................]
0BD134030 00000000 00000000 00000000 00000000  [................]
0BD134040 00000000 0000FFFF 00000000 00000000  [................]
0BD134050 00000000 80008000 002F14BE 03800000  [........../.....]
0BD134060 00000000 002E0005 1ED31F01 00000000  [................]
0BD134070 00000000 00000000 00000002 00001F60  [............`...]
0BD134080 1F3A1F4D 1F141F27 00001F01 00000000  [M.:.'...........]
0BD134090 00000000 00000000 00000000 00000000  [................]
        Repeat 492 times
0BD135F60 7A040201 6B6A0404 0000066C 02004108  [...z..jkl....A..]
0BD135F70 02000000 04790402 6F6E6D04 08000006  [......y..mno....]
0BD135F80 00010041 02020000 04047904 066C6B6A  [A........y..jkl.]
0BD135F90 41080000 00000100 04020200 6D040478  [...A........x..m]
0BD135FA0 00066F6E 00410800 00000000 78040202  [no....A........x]
0BD135FB0 6B6A0404 0000066C 00004108 00000000  [..jkl....A......]
0BD135FC0 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
0BD135FF0 00000000 00000000 00000000 14C00601  [................]

Block scn: 0x2f14c0
Block header dump:  0x0001bf79
 Object id on Block? Y
 seg/obj: 0x11ba6  csc:  0x00000000002f14be  itc: 2  flg: --  typ: 2 - INDEX
     fsl: 0  fnx: 0x0 ver: 0x01

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x02   0xffff.000.00000000  0x00000000.0000.00  C---    0  scn  0x00000000002f14be
Leaf block dump
===============
header address 3172155484=0xbd13405c
kdxcolev 0
KDXCOLEV Flags = - - -
kdxcolok 0
kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y
kdxconco 3
kdxcosdc 0
kdxconro 5
kdxcofbo 46=0x2e
kdxcofeo 7937=0x1f01
kdxcoavs 7891
kdxlespl 0
kdxlende 0
kdxlenxt 0=0x0
kdxleprv 0=0x0
kdxledsz 2
kdxlebksz 8032
row#0[8013] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 78
col 1; len 4; (4):  04 6a 6b 6c
col 2; len 6; (6):  00 00 08 41 00 00
row#1[7994] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 78
col 1; len 4; (4):  04 6d 6e 6f
col 2; len 6; (6):  00 00 08 41 00 00
row#2[7975] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 79
col 1; len 4; (4):  04 6a 6b 6c
col 2; len 6; (6):  00 00 08 41 00 01
row#3[7956] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 79
col 1; len 4; (4):  04 6d 6e 6f
col 2; len 6; (6):  00 00 08 41 00 01
row#4[7937] flag: -------, lock: 0, len=19, data:(2):  00 01
col 0; len 2; (2):  04 7a
col 1; len 4; (4):  04 6a 6b 6c
col 2; len 6; (6):  00 00 08 41 00 02
----- end of leaf block Logical dump -----
----- end of leaf block dump -----

There is one index entry per unique value in the array, so 6 entries in total:

entry key 0 (hex) field1 key 1 (hex) field2 rowid row
0 04 78 'x' 04 6a 6b 6c 'jkl' 00 00 08 41 00 00 1
1 04 78 'x' 04 6d 6e 6f 'mno' 00 00 08 41 00 00 1
2 04 79 'y' 04 6a 6b 6c 'jkl' 00 00 08 41 00 01 2
3 04 79 'y' 04 6d 6e 6f 'mno' 00 00 08 41 00 01 2
4 04 7a 'z' 04 6a 6b 6c 'jkl' 00 00 08 41 00 02 3

The index appears to be suitable for ordered entries in MongoDB, which sorts the documents in ascending order by the minimum value of an array. However, Oracle Multi-Value indexes do not preserve the sort as they add a deduplication step. We will see that in execution plans.

Execution Plan in MongoDB

Here is an example, with result and execution plan, in MongoDB:

db.franck.drop();

db.franck.createIndex({ field1: 1, field2: 2});

db.franck.insertMany([
     { field1: "x", field2: [ "jkl" , "mno" , "jkl" ] },
     { field1: "y", field2: [ "mno" , "jkl"         ] },
     { field1: "z", field2:   "jkl"                   },
    ]);

db.franck.find({ field1:"x" }).sort({ field1: 1, field2: 1 });

[
  {
    _id: ObjectId('689f03ed160c7ec59cd4b0ce'),
    field1: 'x',
    field2: [ 'jkl', 'mno', 'jkl' ]
  }
]

db.franck.find(
 { field1:"x" }
).sort(
 { field1: 1, field2: 1 }
).explain("executionStats").executionStats;

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 2,
  totalDocsExamined: 1,
  executionStages: {
    isCached: false,
    stage: 'FETCH',
    nReturned: 1,
    executionTimeMillisEstimate: 0,
    works: 3,
    advanced: 1,
    needTime: 1,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    docsExamined: 1,
    alreadyHasObj: 0,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 3,
      advanced: 1,
      needTime: 1,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      keyPattern: { field1: 1, field2: 2 },
      indexName: 'field1_1_field2_2',
      isMultiKey: true,
      multiKeyPaths: { field1: [], field2: [ 'field2' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { field1: [ '["x", "x"]' ], field2: [ '[MinKey, MaxKey]' ] },
      keysExamined: 2,
      seeks: 1,
      dupsTested: 2,
      dupsDropped: 1
    }
  }
}

The execution plan shows an IXSCAN and a FETCH, but no SORT operation because the index entries are scanned in the expected order.

Execution Plan in MongoDB emulation

I run the same on the MongoDB API emulation of Oracle Autonomous Database 23ai:

ora> db.franck.drop();

true

ora> db.franck.createIndex({ field1: 1, field2: 2});

field1_1_field2_2

ora> db.franck.insertMany([
         { field1: "x", field2: [ "jkl" , "mno" , "jkl" ] },
         { field1: "y", field2: [ "mno" , "jkl"         ] },
         { field1: "z", field2:   "jkl"                   },
        ]);

{
  acknowledged: true,
  insertedIds: {
    '0': ObjectId('689f041f23cd9d33b9d4b0c5'),
    '1': ObjectId('689f041f23cd9d33b9d4b0c6'),
    '2': ObjectId('689f041f23cd9d33b9d4b0c7')
  }
}

ora> db.franck.find(
 { field1:"x" }
).sort(
 { field1: 1, field2: 1 }
).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'ora.franck',
    indexFilterSet: false,
    parsedQuery: {
      '$query': { field1: { '$stringOnly': 'x' } },
      '$orderby': {
        '$fields': [
          { path: 'field1', order: 'asc', sortByMinMax: true },
          { path: 'field2', order: 'asc', sortByMinMax: true }
        ],
        '$lax': true
      }
    },
    rewrittenQuery: {
      '$and': [
        {
          '$query': { '$value': { field1: { '
                                    
                                    
                                    
                                    
                                

Why doesn't Oracle Multi-Value Index optimize .sort() like MongoDB does with its multi-key index? RecordId deduplication.

Until recently, Oracle Database offered very limited options for creating inverted indexes, primarily restricted to special indexing for text and spatial data. Unlike PostgreSQL with Generalized Inverted Index (GIN), Oracle lacked a general solution. However, in response to the growing demand for document databases and to provide a MongoDB emulation, Oracle introduced the Multi-Value Index (MVI) in version 23ai. Let's create one and examine its internal structure and hot it differs from MongoDB indexes.

Lab setup

I start a Docker container with Oracle 23ai running and connect with SQL*Plus:

docker run --name ora -d -e ORACLE_PWD=OracleDatabase_23ai container-registry.oracle.com/database/free:latest

docker exec -it ora bash -c 'until grep -B100 "Completed: ALTER DATABASE OPEN" /opt/oracle/diag/rdbms/free/FREE/trace/alert_FREE.log ; do sleep 1 ; done'

docker exec -it ora sqlplus / as sysdba

I create a table to store some JSON documents:

DROP TABLE if exists franck;
CREATE TABLE franck (data JSON);

INSERT INTO franck VALUES
  ('{ field1: "x", field2: [ "jkl" , "mno" , "jkl" ] }') -- 78 , [ 6a 6b 6c , 6d 6e 6f , 6a 6b 6c ]
 ,('{ field1: "y", field2: [ "mno" , "jkl"         ] }') -- 79 , [ 6d 6e 6f , 6a 6b 6c ]
 ,('{ field1: "z", field2:   "jkl"                   }') -- 7a ,   6a 6b 6c
);

The structure is flexible, and for the second field, I inserted scalar (in the third document) and arrays, some with redundant values, and with different orders.

Multi-Value Index

I create a multi-value index similar to what the MongoDB API emulation would create. It uses JSON_TABLE to extract fields to columns to be indexed. It uses JSON_MKMVI to make an index entry for the multi-value index:

CREATE MULTIVALUE INDEX FRANCK_MVI ON FRANCK (
 JSON_MKMVI(
  JSON_TABLE(
    "DATA", '$' COLUMNS (
     "SCALAR" PATH '$."field1"' ERROR ON ERROR NULL ON EMPTY NULL ON MISMATCH,
     NESTED PATH '$."field2"[*]' COLUMNS (
       "K0" ANY ORA_RAWCOMPARE PATH '$' ERROR ON ERROR NULL ON EMPTY NULL ON MISMATCH
     )
    )
  )
 )
);

ORA_RAWCOMPARE is a clue that the ordering of nested values is binary. It would be nice to run those functions and see how they make index entries from JSON, but we cannot use them directly:

  select JSON_TABLE(
         *
ERROR at line 2:
ORA-61712: Can't use json_table syntax to create non-MVI(MultiValue Index): .
Help: https://docs.oracle.com/error-help/db/ora-61712/

The index entries will have a compound key with values extracted from the DATA column:

  • field1 as a scalar value (not used by the MongoDB emulation as all fields can be an array)
  • field2[*] as the values within an array, and NESTED PATH unnests the values (like $unwind)

The indexes created by the MongoDB API emulation are all NESTED PATH to be compatible with MongoDB's flexible schema, where any field in a document can be a scalar or an array, but I created one scalar to see the behavior.

Index internals

The index is created, making it easy to dump what's inside. I get the block identifier from DBA_SEGMENTS and dump the block:

SQL> -- get the block offset in datafile 1 ( I run this as sysdba, it goes so SYSTEM tablespace )
SQL> column block new_value block
SQL> select header_block+1 as block from dba_segments where segment_name='FRANCK_MVI' and owner=user;

     BLOCK
----------
    114553

SQL> -- set tracefile identifier and dump the block
SQL> alter session set tracefile_identifier = 'franck&block';
old   1: alter session set tracefile_identifier = 'franck&block'
new   1: alter session set tracefile_identifier = 'franck    114553'

Session altered.

SQL> -- get tracefile name and print
SQL> column tracefile new_value tracefile
SQL> select tracefile from v$process where addr=(select paddr from v$session where sid=sys_context('userenv','sid'));

TRACEFILE
--------------------------------------------------------------------------------
/opt/oracle/diag/rdbms/free/FREE/trace/FREE_ora_3518_franck    114553.trc

SQL> host cat "&tracefile."
Trace file /opt/oracle/diag/rdbms/free/FREE/trace/FREE_ora_3518_franck    114553.trc
Oracle Database 23ai Free Release 23.0.0.0.0 - Develop, Learn, and Run for Free
Version 23.8.0.25.04
Build label:    RDBMS_23.8.0.25.04DBRU_LINUX.X64_250423
ORACLE_HOME:    /opt/oracle/product/23ai/dbhomeFree
System name:    Linux
Node name:      6fc22939038c
Release:        4.14.35-2047.540.4.1.el7uek.x86_64
Version:        #2 SMP Thu Sep 5 12:06:13 PDT 2024
Machine:        x86_64
VM name:        KVM CPUID feature flags: 0x01003afb
CLID:   P
Instance name: FREE
Instance number: 1
Database name: FREE
Database unique name: FREE
Database id: 1471336774
Database role: PRIMARY
Redo thread mounted by this instance: 1
Oracle binary inode:    1036542
Oracle process number: 63
Unix process pid:       3518
             NID:       4026532223
             created:   2025-08-15T09:12:36.-2739213311+00:00
             image:     oracle@6fc22939038c (TNS V1-V3)


*** 2025-08-15T09:13:47.601001+00:00 (CDB$ROOT(1))
*** SESSION ID:(213.17644) 2025-08-15T09:13:47.601014+00:00
*** CLIENT ID:() 2025-08-15T09:13:47.601024+00:00
*** SERVICE NAME:(SYS$USERS) 2025-08-15T09:13:47.601033+00:00
*** MODULE NAME:(sqlplus@6fc22939038c (TNS V1-V3)) 2025-08-15T09:13:47.601043+00:00
*** ACTION NAME:() 2025-08-15T09:13:47.601052+00:00
*** CLIENT DRIVER:(SQL*PLUS) 2025-08-15T09:13:47.601060+00:00
*** CONTAINER ID:(1) 2025-08-15T09:13:47.601069+00:00
*** CLIENT IP:(N/A) 2025-08-15T09:13:47.601078+00:00
*** CONNECTION ID:(PGTKhibdDbfgYwIAEaztFA==) 2025-08-15T09:13:47.601088+00:00

Start dump data blocks tsn: 0 file#:1 minblk 114553 maxblk 114553
Block dump from cache:

Dump of buffer cache at level 3 for pdb=1 tsn=0 rdba=114553

BH (0xbdf5b740) file#: 1 rdba: 0x0001bf79 (1024/114553) class: 1 ba: 0xbd134000
  set: 6 pool: 3 bsz: 8192 bsi: 0 sflg: 2 pwc: 0,0
  dbwrid: 0 obj: 72614 objn: 72614 tsn: [1/0] afn: 1 hint: 0xf
  hash: [0xc3f52b50,0x77de4520] lru: [0xbdf5d2e0,0xbdf5b990]
  ckptq: [NULL] fileq: [NULL]
  objq: [0x76c0c790,0x76c0c790] objaq: [0x76c0c780,0x76c0c780] qhead: 0x76c0c770
  st: XCURRENT md: NULL fpin: 'ktswh23: ktsfbkl' fscn: 0x2f14be fcur_vldr: 0x3 tch: 1
  flags: block_written_once
  LRBA: [0x0.0.0] LSCN: [0x0] HSCN: [0x2f14c0] HSUB: [1]
  Printing buffer operation history (latest change first): (cnt 9)
  01. sid:11 L192:kcbbic2:bic:FBD     02. sid:11 L191:kcbbic2:bic:FBW
  03. sid:11 L602:bic1_int:bis:FWC    04. sid:11 L822:bic1_int:ent:rtn
  05. sid:11 L832:oswmqbg1:clr:WRT    06. sid:11 L930:kubc:sw:mq
  07. sid:11 L913:bxsv:sw:objq        08. sid:11 L608:bxsv:bis:FBW
  09. sid:11 L607:bxsv:bis:FFW        10. sid:05 L464:chg1_mn:bic:FMS
  11. sid:05 L778:chg1_mn:bis:FMS     12. sid:05 L552:chg_main:bic:CLN
  13. sid:05 L353:gcur:set:MEXCL      14. sid:05 L353:gcur:set:MEXCL

  buffer tsn: 0 rdba: 0x0001bf79 (1024/114553)
  scn: 0x2f14c0 seq: 0x01 flg: 0x04 tail: 0x14c00601
  frmt: 0x02 chkval: 0x2a1a type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
Dumping 'block' addr=0xbd134000 size=8192 bytes
Dump of memory from 0xbd134000 to 0xbd136000
0BD134000 0000A206 0001BF79 002F14C0 04010000  [....y...../.....]
0BD134010 00002A1A 00400102 00011BA6 002F14BE  [.*....@......./.]
0BD134020 00008000 00020002 00000000 00000000  [................]
0BD134030 00000000 00000000 00000000 00000000  [................]
0BD134040 00000000 0000FFFF 00000000 00000000  [................]
0BD134050 00000000 80008000 002F14BE 03800000  [........../.....]
0BD134060 00000000 002E0005 1ED31F01 00000000  [................]
0BD134070 00000000 00000000 00000002 00001F60  [............`...]
0BD134080 1F3A1F4D 1F141F27 00001F01 00000000  [M.:.'...........]
0BD134090 00000000 00000000 00000000 00000000  [................]
        Repeat 492 times
0BD135F60 7A040201 6B6A0404 0000066C 02004108  [...z..jkl....A..]
0BD135F70 02000000 04790402 6F6E6D04 08000006  [......y..mno....]
0BD135F80 00010041 02020000 04047904 066C6B6A  [A........y..jkl.]
0BD135F90 41080000 00000100 04020200 6D040478  [...A........x..m]
0BD135FA0 00066F6E 00410800 00000000 78040202  [no....A........x]
0BD135FB0 6B6A0404 0000066C 00004108 00000000  [..jkl....A......]
0BD135FC0 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
0BD135FF0 00000000 00000000 00000000 14C00601  [................]

Block scn: 0x2f14c0
Block header dump:  0x0001bf79
 Object id on Block? Y
 seg/obj: 0x11ba6  csc:  0x00000000002f14be  itc: 2  flg: --  typ: 2 - INDEX
     fsl: 0  fnx: 0x0 ver: 0x01

 Itl           Xid                  Uba         Flag  Lck        Scn/Fsc
0x01   0x0000.000.00000000  0x00000000.0000.00  ----    0  fsc 0x0000.00000000
0x02   0xffff.000.00000000  0x00000000.0000.00  C---    0  scn  0x00000000002f14be
Leaf block dump
===============
header address 3172155484=0xbd13405c
kdxcolev 0
KDXCOLEV Flags = - - -
kdxcolok 0
kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y
kdxconco 3
kdxcosdc 0
kdxconro 5
kdxcofbo 46=0x2e
kdxcofeo 7937=0x1f01
kdxcoavs 7891
kdxlespl 0
kdxlende 0
kdxlenxt 0=0x0
kdxleprv 0=0x0
kdxledsz 2
kdxlebksz 8032
row#0[8013] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 78
col 1; len 4; (4):  04 6a 6b 6c
col 2; len 6; (6):  00 00 08 41 00 00
row#1[7994] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 78
col 1; len 4; (4):  04 6d 6e 6f
col 2; len 6; (6):  00 00 08 41 00 00
row#2[7975] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 79
col 1; len 4; (4):  04 6a 6b 6c
col 2; len 6; (6):  00 00 08 41 00 01
row#3[7956] flag: -------, lock: 0, len=19, data:(2):  00 02
col 0; len 2; (2):  04 79
col 1; len 4; (4):  04 6d 6e 6f
col 2; len 6; (6):  00 00 08 41 00 01
row#4[7937] flag: -------, lock: 0, len=19, data:(2):  00 01
col 0; len 2; (2):  04 7a
col 1; len 4; (4):  04 6a 6b 6c
col 2; len 6; (6):  00 00 08 41 00 02
----- end of leaf block Logical dump -----
----- end of leaf block dump -----

There is one index entry per unique value in the array, so five entries in total:

entry key 0 (hex) field1 key 1 (hex) field2 rowid row
0 04 78 'x' 04 6a 6b 6c 'jkl' 00 00 08 41 00 00 1
1 04 78 'x' 04 6d 6e 6f 'mno' 00 00 08 41 00 00 1
2 04 79 'y' 04 6a 6b 6c 'jkl' 00 00 08 41 00 01 2
3 04 79 'y' 04 6d 6e 6f 'mno' 00 00 08 41 00 01 2
4 04 7a 'z' 04 6a 6b 6c 'jkl' 00 00 08 41 00 02 3

The index appears to be suitable for ordered entries in MongoDB, which sorts the documents in ascending order by the minimum value of an array. However, Oracle multi-value indexes do not preserve the sort as they add a deduplication step. We will see that in execution plans.

Execution plan in MongoDB

Here is an example, with result and execution plan, in MongoDB:

db.franck.drop();

db.franck.createIndex({ field1: 1, field2: 2});

db.franck.insertMany([
     { field1: "x", field2: [ "jkl" , "mno" , "jkl" ] },
     { field1: "y", field2: [ "mno" , "jkl"         ] },
     { field1: "z", field2:   "jkl"                   },
    ]);

db.franck.find({ field1:"x" }).sort({ field1: 1, field2: 1 });

[
  {
    _id: ObjectId('689f03ed160c7ec59cd4b0ce'),
    field1: 'x',
    field2: [ 'jkl', 'mno', 'jkl' ]
  }
]

db.franck.find(
 { field1:"x" }
).sort(
 { field1: 1, field2: 1 }
).explain("executionStats").executionStats;

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 2,
  totalDocsExamined: 1,
  executionStages: {
    isCached: false,
    stage: 'FETCH',
    nReturned: 1,
    executionTimeMillisEstimate: 0,
    works: 3,
    advanced: 1,
    needTime: 1,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    docsExamined: 1,
    alreadyHasObj: 0,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 3,
      advanced: 1,
      needTime: 1,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      keyPattern: { field1: 1, field2: 2 },
      indexName: 'field1_1_field2_2',
      isMultiKey: true,
      multiKeyPaths: { field1: [], field2: [ 'field2' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { field1: [ '["x", "x"]' ], field2: [ '[MinKey, MaxKey]' ] },
      keysExamined: 2,
      seeks: 1,
      dupsTested: 2,
      dupsDropped: 1
    }
  }
}

The execution plan shows an IXSCAN and a FETCH, but no SORT operation because the index entries are scanned in the expected order.

Important statistics to note are: dupsTested: 2 and dupsDropped: 1. A multi-key index contains multiple keys for one document, but only one key is needed to fetch the document, resulting in the dropping of duplicates. MongoDB IXSCAN can eliminate duplicate keys from the same document while maintaining key order, keeping the key that matters (the minimum for ascending key order) and eliminating the need for an additional sort.

Execution plan in MongoDB emulation

I run the same on the MongoDB API emulation of Oracle Autonomous Database 23ai:

ora> db.franck.drop();

true

ora> db.franck.createIndex({ field1: 1, field2: 2});

field1_1_field2_2

ora> db.franck.insertMany([
         { field1: "x", field2: [ "jkl" , "mno" , "jkl" ] },
         { field1: "y", field2: [ "mno" , "jkl"         ] },
         { field1: "z", field2:   "jkl"                   },
        ]);

{
  acknowledged: true,
  insertedIds: {
    '0': ObjectId('689f041f23cd9d33b9d4b0c5'),
    '1': ObjectId('689f041f23cd9d33b9d4b0c6'),
    '2': ObjectId('689f041f23cd9d33b9d4b0c7')
  }
}

ora> db.franck.find(
 { field1:"x" }
).sort(
 { field1: 1, field2: 1 }
).explain("executionStats")

{
  queryPlanner: {
    plannerVersion: 1,
    namespace: 'ora.franck',
    indexFilterSet: false,
    parsedQuery: {
      '$query': { field1: { '$stringOnly': 'x' } },
      '$orderby': {
        '$fields': [
          { path: 'field1', order: 'asc', sortByMinMax: true },
          { path: 'field2', order: 'asc', sortByMinMax: true }
        ],
        '$lax': true
      }
    },
    rewrittenQuery: {
      '