a curated list of database news from authoritative sources

December 23, 2025

Kubernetes Multi-Cloud Architecture: Building Portable Databases Without Lock-In

Most organizations now run across multiple clouds, pursuing flexibility, better pricing, or regional availability. But while stateless applications move freely, databases often remain stuck. Each cloud provider offers its own managed database service (e.g., RDS, Cloud SQL, Azure Database) with distinct APIs, automation tools, and monitoring layers. Once you commit to one, moving becomes complicated […]

December 22, 2025

Memory Management in MongoDB 8.0: Testing the New TCMalloc

With MongoDB 8.0, the database engine takes another step forward in performance optimization, particularly in how it manages memory. One of the most impactful changes under the hood is the updated version of TCMalloc (Thread-Caching Malloc), which affects how the server allocates, caches, and reuses memory blocks. For workloads with high concurrency, long-running queries, or […]

December 21, 2025

JSONB vs. BSON: Tracing PostgreSQL and MongoDB Wire Protocols

There is an essential difference between MongoDB’s BSON and PostgreSQL’s JSONB. Both are binary JSON formats, but they serve different roles. JSONB is purely an internal storage format for JSON data in PostgreSQL. BSON, on the other hand, is MongoDB’s native data format: it is used by application drivers, over the network, in memory, and on disk.

JSONB: PostgreSQL internal storage format

JSONB is a storage format, as defined by the PostgreSQL documentation:

PostgreSQL offers two types for storing JSON data: json and jsonb

PostgreSQL uses JSONB solely for internal storage, requiring the entire structure to be read to access a field, as observed in JSONB DeTOASTing (read amplification).

BSON: MongoDB storage and exchange format

BSON is used for storage and also as an exchange format between the application and the database, as defined in the BSON specification:

BSON can be com­pared to bin­ary inter­change for­mats, like Proto­col Buf­fers. BSON is more "schema-less" than Proto­col Buf­fers

On the application side, the MongoDB driver converts application objects to BSON, which supports more data types than JSON or JSONB, including datetime and binary. This BSON is sent and received over the network and stored and manipulated on the server as-is, with no extra parsing. Both the driver and the database can efficiently access fields via the binary structure because BSON includes metadata such as field length prefixes and explicit type information, even for large or nested documents.

PostgreSQL protocol is JSON (text), not JSONB

To illustrate this, I've written a small Python program that inserts a document into a PostgreSQL table with a JSONB column, and queries that table to retrieve the document:

from sqlalchemy import Column, Integer, create_engine
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.orm import declarative_base, sessionmaker

Base = declarative_base()

class Item(Base):
    __tablename__ = 'items'
    id = Column(Integer, primary_key=True)
    data = Column(JSONB)  # our JSONB column

# Connect to Postgres
engine = create_engine('postgresql+psycopg2://', echo=True)
Session = sessionmaker(bind=engine)
session = Session()

# Create table
Base.metadata.create_all(engine)

# Insert an object into JSONB column
obj = {"name": "widget", "price": 9.99, "tags": ["new", "sale"]}
session.add(Item(data=obj))
session.commit()

# Read back the table
for row in session.query(Item).all():
    print(row.id, row.data)

The program uses SQLAlchemy to send and retrieve Python objects to and from PostgreSQL via the Psycopg2 driver. I've stored it in demo.py.

When I run the program, with python demo.py, before it displays the final result, it logs all SQL statements:

2025-12-21 12:50:22,484 INFO sqlalchemy.engine.Engine select pg_catalog.version()
2025-12-21 12:50:22,485 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine select current_schema()
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine show standard_conforming_strings
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,487 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,488 INFO sqlalchemy.engine.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
2025-12-21 12:50:22,488 INFO sqlalchemy.engine.Engine [generated in 0.00015s] {'name': 'items'}
2025-12-21 12:50:22,489 INFO sqlalchemy.engine.Engine
CREATE TABLE items (
        id SERIAL NOT NULL,
        data JSONB,
        PRIMARY KEY (id)
)


2025-12-21 12:50:22,489 INFO sqlalchemy.engine.Engine [no key 0.00011s] {}
2025-12-21 12:50:22,491 INFO sqlalchemy.engine.Engine COMMIT
2025-12-21 12:50:22,493 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,494 INFO sqlalchemy.engine.Engine INSERT INTO items (data) VALUES (%(data)s) RETURNING items.id
2025-12-21 12:50:22,494 INFO sqlalchemy.engine.Engine [generated in 0.00018s] {'data': '{"name": "widget", "price": 9.99, "tags": ["new", "sale"]}'}
2025-12-21 12:50:22,495 INFO sqlalchemy.engine.Engine COMMIT
2025-12-21 12:50:22,497 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,498 INFO sqlalchemy.engine.Engine SELECT items.id AS items_id, items.data AS items_data
FROM items
2025-12-21 12:50:22,498 INFO sqlalchemy.engine.Engine [generated in 0.00013s] {}

1 {'name': 'widget', 'tags': ['new', 'sale'], 'price': 9.99}

To see what is sent and received through the network by the PostgreSQL protocol, I run the program with strace, showing the sendto and recv system calls with their arguments: strace -e trace=sendto,recvfrom -yy -s 1000 python demo.py.

Like most SQL database drivers, the protocol is basic: send SQL commands as text, and fetch a tabular result set. In the PostgreSQL protocol's messages, the first letter is the message type (Q for Simple Query Message, followed by the length of the message, and the message in text, X to terminate the session, C for command completion status, T abd D for the resultset).

Here is the output, the lines starting with the timestamp are the logs from SQL Alchemy, those starting with sendto and recv are the network system calls with the message to the database, and the result from it

Where is the trace when inserting one document:

2025-12-21 16:52:20,278 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 16:52:20,279 INFO sqlalchemy.engine.Engine INSERT INTO items (data) VALUES (%(data)s) RETURNING items.id
2025-12-21 16:52:20,279 INFO sqlalchemy.engine.Engine [generated in 0.00029s] {'data': '{"name": "widget", "price": 9.99, "tags": ["new", "sale"]}'}

sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\nBEGIN\0", 11, MSG_NOSIGNAL, NULL, 0) = 11

recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17

sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0vINSERT INTO items (data) VALUES ('{\"name\": \"widget\", \"price\": 9.99, \"tags\": [\"new\", \"sale\"]}') RETURNING items.id\0", 119, MSG_NOSIGNAL, NULL, 0) = 119

recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "T\0\0\0\33\0\1id\0\0\0@\310\0\1\0\0\0\27\0\4\377\377\377\377\0\0D\0\0\0\v\0\1\0\0\0\0011C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 62
2025-12-21 16:52:20,281 INFO sqlalchemy.engine.Engine COMMIT

sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\vCOMMIT\0", 12, MSG_NOSIGNAL, NULL, 0) = 12

recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 18

It started a transaction (Q\0\0\0\nBEGIN), received command completion (C\0\0\0\nBEGIN), then sent the full text of the INSERT command, including the JSON payload (Q\0\0\0vINSERT INTO items (data) VALUES ('{\"name\": \"widget\", \"price\": 9.99, \"tags\": [\"new\", \"sale\"]}). It subsequently received command completion (INSERT 0 1) and the returned ID (T\0\0\0\33\0\1id, D\0\0\0\v\0\1\0\0\0\001).

Here is the trace when I query and fetch the document:

2025-12-21 16:52:20,283 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 16:52:20,285 INFO sqlalchemy.engine.Engine SELECT items.id AS items_id, items.data AS items_data
FROM items
2025-12-21 16:52:20,285 INFO sqlalchemy.engine.Engine [generated in 0.00024s] {}

sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\nBEGIN\0", 11, MSG_NOSIGNAL, NULL, 0) = 11

recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17

sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0FSELECT items.id AS items_id, items.data AS items_data \nFROM items\0", 71, MSG_NOSIGNAL, NULL, 0) = 71

recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "T\0\0\0>\0\2items_id\0\0\0@\310\0\1\0\0\0\27\0\4\377\377\377\377\0\0items_data\0\0\0@\310\0\2\0\0\16\332\377\377\377\377\377\377\0\0D\0\0\0I\0\2\0\0\0\0011\0\0\0:{\"name\": \"widget\", \"tags\": [\"new\", \"sale\"], \"price\": 9.99}C\0\0\0\rSELECT 1\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 157

It started another transaction, sent the SELECT statement as text and received the result as JSON text (D\0\0\0I\0\2\0\0\0\0011\0\0\0:{\"name\": \"widget\", \"tags\": [\"new\", \"sale\"], \"price\": 9.99}).

Finally, the transaction ends, and the sessionis disconnected:

sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\rROLLBACK\0", 14, MSG_NOSIGNAL, NULL, 0) = 14

recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 20

sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "X\0\0\0\4", 5, MSG_NOSIGNAL, NULL, 0) = 5

If you want to dig into the code, the server-side parsing is in jsonb_send and jsonb_recv ("The type is sent as text in binary mode"), and while it tests the version before converting to text, there's only one version. The client-side for Psycopg2 shows that register_default_jsonb is the same as register_default_json

Comparing with MongoDB

To compare with MongoDB, created the following demo-mongodb.py:

from pymongo import MongoClient
client = MongoClient("mongodb://127.0.0.1:27017")
db = client.my_database
insert_result = db.items.insert_one({"name": "widget", "price": 9.99, "tags": ["new", "sale"]})
print("Inserted document ID:", insert_result.inserted_id)
for doc in items_collection.find():
    print(doc["_id"], doc)

I used the same strace command, but displaying all characters as hexadecimal to be able to decode them with bsondump:

$ strace -e trace=sendto,recvfrom -xx -yy -s 1000 python demo-mongodb.py 2>&1

Here is the network request for the insert:


sendto(5<TCP:[127.0.0.1:44570->127.0.0.1:27017]>, "\xd6\x00\x00\x00\x51\xdc\xb0\x74\x00\x00\x00\x00\xdd\x07\x00\x00\x00\x00\x00\x00\x00\x5a\x00\x00\x00\x02\x69\x6e\x73\x65\x72\x74\x00\x06\x00\x00\x00\x69\x74\x65\x6d\x73\x00\x08\x6f\x72\x64\x65\x72\x65\x64\x00\x01\x03\x6c\x73\x69\x64\x00\x1e\x00\x00\x00\x05\x69\x64\x00\x10\x00\x00\x00\x04\x31\xb8\x9a\x81\xfd\x35\x42\x1a\x88\x44\xa8\x69\xe8\xba\x6f\x30\x00\x02\x24\x64\x62\x00\x0c\x00\x00\x00\x6d\x79\x5f\x64\x61\x74\x61\x62\x61\x73\x65\x00\x00\x01\x66\x00\x00\x00\x64\x6f\x63\x75\x6d\x65\x6e\x74\x73\x00\x58\x00\x00\x00\x07\x5f\x69\x64\x00\x69\x48\x3f\x7f\x87\x46\xd5\x2e\xe2\x0b\xbc\x0b\x02\x6e\x61\x6d\x65\x00\x07\x00\x00\x00\x77\x69\x64\x67\x65\x74\x00\x01\x70\x72\x69\x63\x65\x00\x7b\x14\xae\x47\xe1\xfa\x23\x40\x04\x74\x61\x67\x73\x00\x1c\x00\x00\x00\x02\x30\x00\x04\x00\x00\x00\x6e\x65\x77\x00\x02\x31\x00\x05\x00\x00\x00\x73\x61\x6c\x65\x00\x00\x00", 214, 0, NULL, 0) = 214

recvfrom(5<TCP:[127.0.0.1:44570->127.0.0.1:27017]>, "\x2d\x00\x00\x00\x06\x00\x00\x00\x51\xdc\xb0\x74\xdd\x07\x00\x00", 16, 0, NULL, NULL)<... (truncated)
                                    

December 20, 2025

IO-bound sysbench vs MySQL on a 48-core server

This has results for an IO-bound sysbench benchmark on a 48-core server for MySQL versions 5.6 through 9.5. Results from a CPU-bound sysbench benchmark on the 48-core server are here.

tl;dr

  • the regressions here on read-only tests are smaller than on the CPU bound workload, but when they occur are from new CPU overheads
  • the large improvements here on write-heavy tests are similar to the CPU bound workload

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The server is:
  • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
  • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
  • 128G RAM
  • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
The config files are here: 5.6.515.7.448.0.4x8.4.x9.x.0

Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 250M rows per table. With 250M rows per table this is IO-bound. I normally use 10M rows per table for CPU-bound workloads.

The purpose is to search for regressions from new CPU overhead, mutex contention and IO stress. 

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for MySQL 5.6.51)
When the relative QPS is > 1 then some version is faster than MySQL 5.6.51.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than MySQL 5.6.51.

Values from iostat and vmstat divided by QPS are hereThese can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below.

Results: point queries

This has two charts. The y-axis is truncated on the second chart to improve readability for all tests but hot-points which is a positive outlier.

Summary:
  • the improvement for hot-points is similar to the CPU-bound results
  • the regressions here for the IO-bound tests are smaller than for the CPU-bound results
  • the regression in point-query is from new CPU overhead, see cpu/o here which is 1.37X larger in 9.5.0 vs 5.6.51
  • the regression in points-covered-si is from new CPU overhead, see cpu/o here which is 1.24X larger in 9.5.0 vs 5.6.51. This test is CPU-bound, the queries don't do IO because the secondary indexes are cached.
Results: range queries without aggregation

Summary:
  • the regressions here for the IO-bound tests are smaller than for the CPU-bound results, except for the scan test
  • the regressions in scan are from new CPU overhead, see cpu/o here, which is 1.38X larger in 9.5.0 vs 5.6.51
Results: range queries with aggregation

Summary:
  • the regressions here for the IO-bound tests are smaller than for the CPU-bound results
  • the regressions in read-only-count are from new CPU overhead, see cpu/o here, which is 1.25X larger in 9.5.0 vs 5.6.51
Results: writes

Summary:
  • the improvements here for the IO-bound tests are similar to the CPU-bound results
  • the largest improvement, for the update-index test, is from using less CPU, fewer context switches, less read IO and less write IO per operation -- see cpu/o, cs/o, rKB/o and wKB/o here

December 19, 2025

Atlas Search score details (the BM25 calculation)

With @james_blackwoodsewell_58 we were comparing the BM25 text search scores between MongoDB Atlas (Lucene), ElasticSearch (Lucene) and ParadeDB (using Tantivy) which provide the same ordering but MongoDB Atlas shows constantly a lower score by a factor of 2.2:

Revisiting "Text Search With MongoDB (BM25 TF-IDF) and PostgreSQL"

Back in October Franck Pachot from MongoDB (love your work) wrote a post comparing text search in MongoDB and PostgreSQL (with both the built-in tsvector and ParadeDB's pg_search extension). I'm not going to recap his whole post, but basically Mongo seemed to behave exactly how it should returning B

linkedin.com

It was the occasion for me to look at the score details which gives the calculation details for the score.

Test case

I've built the same test case as in my previous blog:

db.articles.drop();
db.articles.deleteMany({});
db.articles.insertMany([
 { description : "🍏 🍌 🍊" },                // short, 1 🍏
 { description : "🍎 🍌 🍊" },                // short, 1 🍎
 { description : "🍎 🍌 🍊 🍎" },             // larger, 2 🍎
 { description : "🍎 🍌 🍊 🍊 🍊" },          // larger, 1 🍎
 { description : "🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰" },  // large, 1 🍎
 { description : "🍎 🍎 🍎 🍎 🍎 🍎" },       // large, 6 🍎
 { description : "🍎 🍌" },                 // very short, 1 🍎
 { description : "🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎" },  // large, 1 🍎
 { description : "🍎 🍎 🍌 🍌 🍌" },          // shorter, 2 🍎
]);
db.articles.createSearchIndex("default",
  { mappings: { dynamic: true } }
);

Score with details

I ran the same query, adding scoreDetails: true to the search stage, and scoreDetails: { $meta: "searchScoreDetails" } } to the projection stage:


db.articles.aggregate([
  {
    $search: {
      text: {  query: ["🍎", "🍏"],  path: "description"  },
      index: "default",
      scoreDetails: true
    }
  },
  {  $project: {  
        _id: 0,  description: 1,  
        score: { $meta: "searchScore" },  
        scoreDetails: { $meta: "searchScoreDetails" }  }  },
  { $sort: { score: -1 } }  ,
  { $limit: 1 }
])

Here is the result:

mdb> db.articles.aggregate([
...   {
...     $search: {
...       text: {  query: ["🍎", "🍏"],  path: "description"  },
...       index: "default",
...       scoreDetails: true
...     }
...   },
...   {  $project: {  _id: 0,  description: 1,  score: { $meta: "searchScore" },  scoreDetails: { $meta: "searchScoreDetails" }  }  },
...   { $sort: { score: -1 } }  ,
...   { $limit: 1 }
... ])
[
  {
    description: '🍏 🍌 🍊',
    score: 1.0242118835449219,
    scoreDetails: {
      value: 1.0242118835449219,
      description: 'sum of:',
      details: [
        {
          value: 1.0242118835449219,
          description: '$type:string/description:🍏 [BM25Similarity], result of:',
          details: [
            {
              value: 1.0242118835449219,
              description: 'score(freq=1.0), computed as boost * idf * tf from:',
              details: [
                {
                  value: 1.8971199989318848,
                  description: 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:',
                  details: [
                    {
                      value: 1,
                      description: 'n, number of documents containing term',
                      details: []
                    },
                    {
                      value: 9,
                      description: 'N, total number of documents with field',
                      details: []
                    }
                  ]
                },
                {
                  value: 0.5398772954940796,
                  description: 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:',
                  details: [
                    {
                      value: 1,
                      description: 'freq, occurrences of term within document',
                      details: []
                    },
                    {
                      value: 1.2000000476837158,
                      description: 'k1, term saturation parameter',
                      details: []
                    },
                    {
                      value: 0.75,
                      description: 'b, length normalization parameter',
                      details: []
                    },
                    {
                      value: 3,
                      description: 'dl, length of field',
                      details: []
                    },
                    {
                      value: 4.888888835906982,
                      description: 'avgdl, average length of field',
                      details: []
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  }
]

So all is there. Here is the scoring breakdown for "🍏 🍌 🍊", which produced a score of 1.0242118835449219.

IDF calculation (inverse document frequency)

Search result:

  • Number of documents containing the term: n = 1
  • Total number of documents with this field: N = 9
idf = log(1 + (N - n + 0.5) / (n + 0.5))
    = log(1 + (9 - 1 + 0.5) / (1 + 0.5))
    = log(6.666666666666667)`  
     1.8971199989318848

TF calculation (term frequency)

Parameters are the Lucene defaults:

  • Term saturation parameter: k1 = 1.2000000476837158
  • Length normalization parameter: b = 0.75

Document field statistics:

  • Average length of the field: avgdl = 44 / 9 ≈ 4.888888835906982
  • Occurrences of the term in this document: freq = 1
tf = freq / (freq + k1 * (1 - b + b * dl / avgdl)) 
   = 1 / (1 + 1.2000000476837158 × (0.25 + 0.75 × (3 / 4.888888835906982))) 
    0.5398772954940796

Final score

Parameter:

  • Boost: 1.0
score = boost × idf × tf 
      = 1.0 × 1.8971199989318848 × 0.5398772954940796 
       1.0242118835449219

That confirms that Atlas Search uses the same scoring as Lucene https://github.com/apache/lucene/blob/releases/lucene/10.3.2/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L183

What about ElasticSearch and Tantivy

Eight years ago, Lucene removed the (k1 + 1) factor in LUCENE-8563. For k1 = 1.2, this change reduces the score by a factor of 2.2 from that version onward. Tantivy and Elasticsearch apparently still use the old formula, while Atlas Search uses the updated one, which explains the observed differences in scoring.

Conclusion

MongoDB Atlas Search indexes use the same scoring mechanism as Lucene indexes. When you compare them with other text search engines that also use Lucene, you might notice a score difference by a factor of about 2.2. This doesn’t actually matter, because scores are used only for ordering results, and the relative ordering remains the same.

Text search scores can seem magical, but they are fully deterministic and based on open-source formulas. In MongoDB, you can include the score details option in a text search query to inspect all the parameters and formulas behind the score.

Improve Developer Velocity with Kubernetes Databases

Your company has invested heavily in agile development, microservices, and Kubernetes to move faster. Your app teams can spin up a new service in minutes. So why can it still take a week to get a database for it? The bottleneck has shifted. It’s no longer compute; it’s the database. Manual, ticket-based provisioning still dominates, […]

LeaseGuard: Raft Leases Done Right!

Many distributed systems have a leader-based consensus protocol at their heart. The protocol elects one server as the "leader" who receives all writes. The other servers are "followers", hot standbys who replicate the leader’s data changes. Paxos and Raft are the most famous leader-based consensus protocols.

These protocols ensure consistent state machine replication, but reads are still tricky. Imagine a new leader L1 is elected, while the previous leader L0 thinks it's still in charge. A client might write to L1, then read stale data from L0, violating Read Your Writes. How can we prevent stale reads? The original Raft paper recommended that the leader communicate with a majority of followers before each read, to confirm it's the real leader. This guarantees Read Your Writes but it's slow and expensive.

A leader lease is an agreement among a majority of servers that one server will be the only leader for a certain time. This means the leader can run queries without communicating with the followers, and still ensure Read Your Writes. The original description of Raft included a lease protocol that was inherited from the earlier Paxos, where followers refuse to vote for a new leader until the old leader's lease expires. This entangles leases and elections, and it delays recovery after a crash. Besides, lease protocols have never been specified in detail, for either Raft or Paxos. For all these reasons, many Raft implementations don't use leases at all, or their leases are buggy.

In the MongoDB Distributed Systems Research Group, we designed a simple lease protocol tailored for Raft, called LeaseGuard. Our main innovation is to rely on Raft-specific guarantees to design a simpler lease protocol that recovers faster from a leader crash. Here’s a preprint of our SIGMOD'26 paper. This is a joint blog post by A. Jesse Jiryu Davis and Murat Demirbas, published on both of our blogs.

A huge simplification: the log is the lease

In Raft, before the leader executes a write command, it wraps the command in a log entry which it appends to its log. Followers replicate the entry by appending it to their logs. Once an entry is in a majority of servers’ logs, it is committed. Raft’s Leader Completeness property guarantees that any newly elected leader has all committed entries from previous leaders. Raft enforces this during elections: a server votes only for a candidate whose log is at least as up to date as its own. (Paxos doesn’t have this property, so a new leader has to fetch entries from followers before it’s fully functional.)

When designing LeaseGuard, we used Leader Completeness to radically simplify the lease protocol. LeaseGuard does not use extra messages or variables for lease management, and does not interfere with voting or elections. 

In LeaseGuard, the log is the lease. Committing a log entry grants the leader a lease that lasts until a timeout expires. While the lease is valid, the leader can serve consistent reads locally. Because of Leader Completeness, any future leader is guaranteed to have that same entry in its log. When a new leader L1 is elected, it checks its own log for the previous leader L0's last entry, to infer how long to wait for L0's lease to expire.

In existing protocols, the log is not the lease: instead, the leader periodically sends a message to followers which says, "I still have the lease". But imagine a leader who couldn't execute writes or append to its log --perhaps it's overloaded, or its disk is full or faulty-- but still has enough juice to send lease-extension messages. This lame-duck leader could lock up the whole system. In LeaseGuard, a leader maintains its lease only if it can make progress; otherwise, the followers elect a new one.

We are excited by the simplicity of this Raft-specific lease protocol. (We were inspired by some prior work, especially this forum post from Archie Cobbs.) In LeaseGuard, there is no separate code path to establish the lease. We decouple leases from elections. The log is the single source of truth for both replication and leasing.

LeaseGuard makes leader failovers smoother and faster

Leases improve read consistency but can slow recovery after a leader crash. No matter how quickly the surviving servers elect a new leader, it has to wait for the old leader's lease to expire before it can read or write. The system stalls as long as 10 seconds in one of the Raft implementations we studied.

LeaseGuard improves the situation in two ways. First, deferred-commit writes. As soon as a new leader wins election, it starts accepting writes and replicating them to followers. It just defers marking any writes "committed" until the old lease expires. Without this optimization, writes enqueue at the new leader until the old lease expires; then there’s a thundering herd. With our optimization, the new leader keeps up with the write load even while it’s waiting.

Second, inherited lease reads. This is our biggest innovation, and it’s a bit complicated. Consider the situation where L1 was just elected, but L0 is alive and still thinks it’s in charge. Neither leader knows about the other. (Yes, this can really happen during a network partition.) Raft makes sure that L0 can't commit any more writes, but there’s a danger of it serving stale reads. The whole point of leases is to prevent that by blocking L1 from reading and writing until L0’s lease expires. But what if there was a way for both leaders to serve reads, and still guarantee Read Your Writes?


When L1 was elected, it already had all of L0’s committed log entries (Leader Completeness), and maybe some newer entries from L0 that aren’t committed yet. L1 knows it has every committed entry, but it doesn’t know the true commit frontier! We call these ambiguous entries the limbo region. For each query, L1 checks if the result is affected by any entries in the limbo region—if not, L1 just runs the query normally. Otherwise, it waits until the ambiguity is resolved.

Inherited lease reads require synchronized clocks with known error bounds, but the rest of the protocol only needs local timers with bounded drift. Our two optimizations preserve Read Your Writes and dramatically improve availability.

Transitions in the read/write availability of leaders with LeaseGuard. While the new leader waits for a lease, it can serve some consistent reads and accept writes. 



Here's the whole algorithm in pseudo-Python. For more details, please read the paper.

Tests and benchmarks

When we started this research, our main goal was to publish a detailed and correct specification, so Raft implementers everywhere could implement leases without bugs. We’re TLA+ fans so obviously we specified the algorithm in TLA+ and checked it guaranteed Read Your Writes and other correctness properties. We discovered our two optimizations while writing the TLA+ spec. The inherited lease reads optimization was especially surprising to us; we probably wouldn’t have realized it was possible if TLA+ wasn’t helping us think.

We implemented the algorithm in LogCabin, the C++ reference implementation of Raft. (For ease of exposition, we also provide an implementation in a Python simulator.) 

In the following experiment, we illustrate how LeaseGuard improves throughput and reduces time to recovery. We crash the leader 500 ms after the test begins. At the 1000 ms mark, a new leader is elected, and at 1500 ms, the old leader’s lease expires. We ran this experiment with LogCabin in five configurations:
  • Inconsistent: LogCabin running fast and loose, with no guarantee of Read Your Writes.
  • Quorum: The default Read Your Writes mechanism, where the leader talks to a majority of followers before running each query, is miserably slow—notice that its Y axis is one tenth as high as the other charts!
  • Lease: The “log is the lease” protocol with no optimizations. Its throughput is as high as “inconsistent”, but it has a long time to recovery after the old leader crashes.
  • Defer commit: The log is the lease, plus our write optimization—you can see that write throughput spikes off the chart at 1500 ms, because the leader has been processing writes while waiting for the lease. As soon as it gets the lease, it commits all the writes at once.
  • Inherit lease: LeaseGuard with all our optimizations. Read throughput recovers as soon as a new leader is elected, without waiting for the old lease to expire.

Conclusion

Until now, the absence of a detailed specification for Raft leases led to many flawed implementations: they often failed to guarantee consistent reads, had very low read throughput, or recovered slowly from a leader crash. With LeaseGuard now specified, implemented, and published, we hope it will be readily adopted to enable Raft systems to provide fast reads with strong consistency and recover quickly after a crash.

We learned yet again the value of TLA+ during this project. TLA+ is useful not just for checking the correctness of a completed design, but for revealing new insights while the design is in progress. Also, we got interested in reasoning about knowledge, also known as epistemic logic. In Raft, servers can look in their logs and know that other servers know certain facts. For example, if a leader has a committed entry, it knows any future leader knows about this entry, but it doesn’t know if a future leader knows the entry was committed. This is a different way for us to think about a distributed system: it’s not just a state machine, it’s a group of agents with limited knowledge. We’re curious about this way of thinking and plan to do more research.

December 18, 2025

Optimizing correlated subqueries in Amazon Aurora PostgreSQL

Correlated subqueries can cause performance challenges in Amazon Aurora PostgreSQL which can cause applications to experience reduced performance as data volumes grow. In this post, we explore the advanced optimization configurations available in Aurora PostgreSQL that can transform these performance challenges into efficient operations without requiring you to modify a single line of SQL code.

Introducing Percona Load Generator for MongoDB Clusters: The Benchmark Tool That Simulates Your Actual Application

If you have ever tuned a MongoDB cluster that passed every synthetic benchmark with flying colors, only to choke the moment real user traffic hit, you are not alone. For years, database administrators and developers have relied on a standard suite of tools to test MongoDB performance (YCSB, Sysbench, POCDriver and mgodatagen –  just to […]

December 17, 2025

Performance regressions in MySQL 8.4 and 9.x with sysbench

I have been claiming that I don't find significant performance regressions in MySQL 8.4 and 9.x when I use sysbench. I need to change that claim. There are regressions for write-heavy tests, they are larger for tests with more concurrency and larger when gtid support is enabled.

By gtid support is enabled I mean that these options are set to ON:

Both of these are ON by default in MySQL 9.5.0 and were OFF by default in earlier releases. I just learned about the performance impact from these and in future tests I will make probably repeat tests with them set to ON and OFF.

This blog post has results from the write-heavy tests with sysbench for MySQL 8.0, 8.4, 9.4 and 9.5 to explain my claims above.

tl;dr

  • Regressions are most likely and larger on the insert test
  • There are regressions for write-heavy workloads in MySQL 8.4 and 9.x
    • Throughput is typically 15% less in MySQL 9.5 than in 8.0 for tests with 16 clients on the 24-core/2-socket srever
    • Throughput is typically 5% less in MySQL 9.5 than 8.0 for tests with 40 clients on the 48-core server
  • The regressions are larger when gtid_mode and enforce_gtid_consistency are set to ON
    • Throughput is typically 5% to 10% less with the -gtid configs vs the -nogtid configs with 40 clients on the 48-core server. But this is less of an issue on other servers.
    • There are significant increases in CPU, context switch rates and KB written to storage for the -gtid configs relative to the same MySQL version using the -nogtid configs
  • Regressions might be larger for the insert and update-inlist tests because they have larger transactions relative to other write-heavy tests. Performance regressions are correlated with increases in CPU, context switches and KB written to storage per transaction.
What changed?

I use diff to compare the output from SHOW GLOBAL VARIABLES when I build new releases and from that it is obvious that the default value for gtid_mode and enforce_gtid_consistency changed in MySQL 9.5 but I didn't appreciate the impact from that change.

Builds, configuration and hardware

I compiled MySQL from source for versions 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The versions that I tested are named:
  • 8.0.44-nogtid
    • MySQL 8.0.44 with gtid_mode and enforce_gtid_consistency =OFF
  • 8.0.44-gtid
    • MySQL 8.0.44 with gtid_mode and enforce_gtid_consistency =ON
  • 8.4.7-notid
    • MySQL 8.4.7 with gtid_mode and enforce_gtid_consistency =OFF
  • 8.4.7-gtid
    • MySQL 8.4.7 with gtid_mode and enforce_gtid_consistency =ON
  • 9.4.0-nogtid
    • MySQL 9.4.0 with gtid_mode and enforce_gtid_consistency =OFF
  • 9.4.0-gtid
    • MySQL 9.4.0 with gtid_mode and enforce_gtid_consistency =ON
  • 9.5.0-nogtid
    • MySQL 9.5.0 with gtid_mode and enforce_gtid_consistency =OFF
  • 9.5.0-gtid
    • MySQL 9.5.0 with gtid_mode and enforce_gtid_consistency =ON

The servers are:
  • 8-core
    • The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04.
    • my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
    • my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
    • The benchmark is run with 1 thread, 1 table and 50M rows per table
  • 24-core
    • The server is a SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB,  ext4 with discard enabled). The OS is Ubuntu 24.04. The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz.
    • my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
    • my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
    • The benchmark is run with 16 threads, 8 tables and 10M rows per table
  • 48-core
    • The server is ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled and 128G of RAM. Storage is 2 Intel D7-P5520 NVMe devices with RAID 1 (3.8T each) using ext4. The OS is Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic).
    • my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
    • my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
    • The benchmark is run with 40 threads, 8 tables and 10M rows per table
Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB. While I ran all of the tests, I only share results from a subset of the write-heavy tests.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. 

The purpose is to search for regressions from new CPU overhead and mutex contention. The workload is cached -- there should be no read IO but will be some write IO.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. Here I only share results from a subset of the write-heavy tests.

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for MySQL 8.0.44)
When the relative QPS is > 1 then some version is faster than MySQL 8.0.44.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than MySQL 8.0.44.

Values from iostat and vmstat divided by QPS are here for the 8-core, 24-core and 48-core servers. These can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. The y-axis doesn't start at 0 to improve readability.

Results: 8-core

Summary
  • For many tests there are small regressions from 8.0 to 8.4 and 8.4 to 9.x
  • There are small improvements (~5%) for the -gtid configs vs the -nogtid result for update-index
  • There is a small regression (~5%) for the -gtid configs vs the -nogtid result for insert
  • There are small regression (~1%) for the -gtid configs vs the -nogtid result for other tests
From vmstat metrics for the insert test where perf decreases with the 9.5.0-gtid result
  • CPU per operation (cpu/o) increases by 1.10X with the -gtid config
  • Context switches per operation (cs/o) increases by 1.45X with the -gtid config
  • KB written to storage per commit (wKB/o) increases by 1.16X with the -gtid config
From vmstat metrics for the update-index test where perf increases with the 9.5.0-gtid result
  • CPU per operation (cpu/o) decreases by ~3% with the -gtid config
  • Context switches per operation (cs/o) decrease by ~2% with the -gtid config
  • KB written to storage per commit (wKB/o) decreases by ~3% with the -gtid config
  • This result is odd. I might try to reproduce it in the future
Results: 24-core

Summary
  • For many tests there are regressions from 8.0 to 8.4 and 8.4 to 9.x and throughput is typically 15% less in 9.5.0 than 8.0.44
  • There are large regressions in 9.4 and 9.5 for update-inlist
  • There is usually a small regression (~5%) for the -gtid configs vs the -nogtid result
From vmstat metrics for the insert test comparing 9.5.0-gtid with 9.5.0-nogtid
  • Throughput is 1.15X larger in 9.5.0-nogtid
  • CPU per operation (cpu/o) is 1.15X larger in 9.5.0-gtid
  • Context switches per operation (cs/o) are 1.23X larger in 9.5.0-gtid
  • KB written to storage per commit (wKB/o) is 1.24X larger in 9.5.0-gtid
From vmstat metrics for the update-inlist comparing both 9.5.0-nogtid and 9.5.0-nogtid with 8.0.44-nogtid
  • The problems here look different than most other tests as the regressions in 9.4 and 9.5 are similar for the -gtid and -nogtid configs. If I have time I will get flamegraphs and PMP output. The server here has two sockets and can suffer more from false-sharing and real contention on cache lines.
  • Throughput is 1.43X larger in 8.0.44-nogtid
  • CPU per operation (cpu/o) is 1.05X larger in 8.0.44-nogtid
  • Context switches per operation (cs/o) are 1.18X larger in 8.0.44-nogtid
  • KB written to storage per commit (wKB/o) is ~1.12X larger in 9.5.0
Results: 48-core

Summary
  • For many tests there are regressions from 8.0 to 8.4
  • For some tests there are regressions from 8.4 to 9.x
  • There is usually a large regression for the -gtid configs vs the -nogtid result and the worst case occurs on the insert test
From vmstat metrics for the insert test comparing 9.5.0-gtid with 9.5.0-nogtid
  • Throughput is 1.17X larger in 9.5.0-nogtid
  • CPU per operation (cpu/o) is 1.13X larger in 9.5.0-gtid
  • Context switches per operation (cs/o) are 1.26X larger in 9.5.0-gtid
  • KB written to storage per commit (wKB/o) is 1.24X larger in 9.5.0-gtid

Bridging the Gap: Querying PostgreSQL Tables from an Oracle Database

Database interoperability is a common requirement in enterprise environments. For users seeking to access Oracle data from a PostgreSQL instance, the native PostgreSQL extension oracle_fdw (Foreign Data Wrapper) is an excellent and efficient solution. However, the question often arises: How do we perform the reverse? How can an Oracle SQL query execute a SELECT statement […]

The $sql aggregation stage in Oracle Database to replace explain("executionStats")

As application-driven data models and document databases—made popular by MongoDB—continue to gain traction, Oracle Database has added MongoDB emulation capabilities on top of its SQL query engine. It's only the logical model and exposed API that resemble it, as physical documents are stored in relational tables and fixed-size blocks. This adds another abstraction on top of SQL, and when performance is different from what you expected, you need to look at the physical execution behind the logical query. The $sql stage of an aggregation pipeline can help troubleshoot. Let's take an example.

I create a one-million-document collection and index it with the ESR (Equality, Sort, Range) Guideline in mind:

// create the collection
db.oneMillion.drop();
db.oneMillion.createIndex({ e: 1, s: 1, r: 1 });
// insert documents
for (let i = 0; i < 1e2; i++) {  
  void db.oneMillion.insertMany( Array.from(  
    { length: 1e4 },  
    (_, i) => ({ e: i%3, s: new Date(), r: Math.random(), x: UUID() })  
  ) )
}
// check count
db.oneMillion.countDocuments();

I run a query that completes in 1 millisecond on MongoDB but takes seconds in the Oracle Database emulation. It’s a simple pagination query combining:

  • An equality filter: { e: 2 }, which returns one-third of the collection
  • A range filter: { r: { $gt: 0.5 } }, which returns half of those documents
  • A sort with pagination: sort({ s: 1, r: 1 }).limit(10), which returns the last ten documents by date and value

This query runs much faster on MongoDB than in the Oracle emulation. To get execution statistics, I add hint({"$native":"MONITOR"}) so that the underlying SQL query is run with the /*+ MONITOR */ hint:

db.oneMillion.find( 
 { e:2, r: { $gt:0.5 } },
 { s:1, r:1, _id:0 }
).sort({ s:1, r:1 }).limit(10).hint({"$native":"MONITOR"});

The query is executed and returns the result:

[
  { s: ISODate('2025-12-17T08:52:50.475Z'), r: 0.5222276191239983 },
  { s: ISODate('2025-12-17T08:52:50.475Z'), r: 0.7565247894880116 },
  { s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.6099160713187135 },
  { s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.765542699487576 },
  { s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.8144790402364248 },
  { s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.8328191789951023 },
  { s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.8356551175440483 },
  { s: ISODate('2025-12-17T08:52:50.476Z'), r: 0.9779607167502489 },
  { s: ISODate('2025-12-17T08:52:50.477Z'), r: 0.5236033088481526 },
  { s: ISODate('2025-12-17T08:52:50.477Z'), r: 0.5290926931399482 }
]

After running it, I can get the SQL Monitor report for the last query in my session by calling the dbms_sqltune function through the $sql aggregation stage:

db.aggregate([{ $sql : `select 
 dbms_sqltune.report_sql_monitor(report_level=>'all',type=>'text') as "text"
`}]).forEach(row => print(row.text))

The output shows the underlying SQL query generated by the emulation, with its execution plan and execution statistics:

SQL Monitoring Report

SQL Text
------------------------------
select /*+ FIRST_ROWS(10) MONITOR */ json_patch("DATA",:1 project ),rawtohex("RESID"),"ETAG" from "ORA"."oneMillion" where JSON_EXISTS("DATA",'$?( (@.e.numberOnly() == $B0) && (@.r.numberOnly() > $B1) )' passing :2 as "B0", :3 as "B1" type(strict)) order by JSON_QUERY("DATA", '$.s[*].min()') asc nulls first, JSON_QUERY("DATA", '$.r[*].min()') asc nulls first fetch next 10 rows only

Global Information
------------------------------
 Status              :  DONE (ALL ROWS)
 Instance ID         :  4
 Session             :  ORA (36619:4028)
 SQL ID              :  420n5y2ytx6zh
 SQL Execution ID    :  67108867
 Execution Started   :  12/17/2025 09:04:51
 First Refresh Time  :  12/17/2025 09:04:51
 Last Refresh Time   :  12/17/2025 09:04:52
 Duration            :  1s
 Module/Action       :  ORDS_ADBS_Managed/-
 Service             :  CQWRIAXKGYBKVNX_O23_low.adb.oraclecloud.com
 Program             :  ORDS_ADBS_Managed
 Fetch Calls         :  1

Binds
========================================================================================================================
| Name | Position |  Type  |                                           Value                                           |
========================================================================================================================
| :2   |        2 | NUMBER | 2                                                                                         |
| :3   |        3 | NUMBER | .5                                                                                        |
========================================================================================================================

Global Stats
=================================================
| Elapsed |   Cpu   |  Other   | Fetch | Buffer |
| Time(s) | Time(s) | Waits(s) | Calls |  Gets  |
=================================================
|    1.18 |    1.18 |     0.00 |     1 |  98194 |
=================================================

SQL Plan Monitoring Details (Plan Hash Value=2462396944)
==============================================================================================================================================================================
| Id |                 Operation                 |            Name             |  Rows   | Cost |   Time    | Start  | Execs |   Rows   |  Mem  | Activity | Activity Detail |
|    |                                           |                             | (Estim) |      | Active(s) | Active |       | (Actual) | (Max) |   (%)    |   (# samples)   |
==============================================================================================================================================================================
|  0 | SELECT STATEMENT                          |                             |         |      |         1 |     +1 |     1 |       10 |     . |          |                 |
|  1 |   COUNT STOPKEY                           |                             |         |      |         1 |     +1 |     1 |       10 |     . |          |                 |
|  2 |    VIEW                                   |                             |      52 |  316 |         1 |     +1 |     1 |       10 |     . |          |                 |
|  3 |     SORT ORDER BY STOPKEY                 |                             |      52 |  316 |         2 |     +1 |     1 |       10 |  4096 |    50.00 | Cpu (1)         |
|  4 |      FILTER                               |                             |         |      |         1 |     +1 |     1 |     167K |     . |          |                 |
|  5 |       TABLE ACCESS BY INDEX ROWID BATCHED | oneMillion                  |      52 |   38 |         1 |     +1 |     1 |     167K |     . |    50.00 | Cpu (1)         |
|  6 |        HASH UNIQUE                        |                             |      52 |      |         1 |     +1 |     1 |     167K |     . |          |                 |
|  7 |         INDEX RANGE SCAN (MULTI VALUE)    | $ora:oneMillion.e_1_s_1_r_1 |      38 |   27 |         1 |     +1 |     1 |     167K |     . |          |                 |
==============================================================================================================================================================================

Returning those 10 rows used 1.18 seconds of CPU because it used the index only for the filters, returning 166,666 rows that had to be deduplicated (HASH UNIQUE), and sorted (SORT ORDER BY STOPKEY) before returning the result.

Oracle has powerful hints, and you can use them with the hint({"$native": }) (not to be confused with hint({"$natural":1})) of MongoDB. For example, I can try to avoid this HASH UNIQUE that doesn't preserve the ordering from the index:

db.oneMillion.find( 
 { e:2, r: { $gt:0.5 } },
 { s:1, r:1, _id:0 }
).sort({ s:1, r:1 }).limit(10).hint(
 {"$native":'NO_USE_HASH_AGGREGATION MONITOR'}
);

It uses a SORT UNIQUE but still doesn't preserve the index ordering because the deduplication is on the ROWID, so finally it's just an additional sort:

|  3 |     SORT ORDER BY STOPKEY                 |                             |       1 |    3 |         1 |     +1 |     1 |       10 |  4096 |          |                 |
|  4 |      FILTER                               |                             |         |      |         1 |     +1 |     1 |     167K |     . |          |                 |
|  5 |       TABLE ACCESS BY INDEX ROWID BATCHED | oneMillion                  |       1 |    2 |         1 |     +1 |     1 |     167K |     . |   100.00 | Cpu (1)         |
|  6 |        SORT UNIQUE                        |                             |       1 |      |         1 |     +1 |     1 |     167K |  10MB |          |                 |
|  7 |         INDEX RANGE SCAN (MULTI VALUE)    | $ora:oneMillion.e_1_s_1_r_1 |       1 |    2 |         1 |     +1 |     1 |     167K |     . |          |                 |

If you don't have the license for all options (Enterprise Edition, Diagnostic Pack, and Tuning Pack), don't use SQL Monitor. You can still view the execution plan with DBMS_XPLAN. To obtain execution statistics, use the GATHER_PLAN_STATISTICS hint:

db.oneMillion.find( 
 { e:2, r: { $gt:0.5 } },
 { s:1, r:1, _id:0 }
).sort({ s:1, r:1 }).limit(10).hint({"$native":"GATHER_PLAN_STATISTICS"});

The query to get all execution plan sections is:

db.aggregate( [ { $sql : `
 select * 
 from dbms_xplan.display_cursor(null,null,'ADVANCED ALLSTATS LAST')
` } ] ).forEach(row => print(row.PLAN_TABLE_OUTPUT));
SQL_ID  66jyw5hxfx4zh, child number 0
-------------------------------------
select /*+ FIRST_ROWS(10) GATHER_PLAN_STATISTICS */
json_patch("DATA",:1 project ),rawtohex("RESID"),"ETAG" from
"ORA"."oneMillion" where JSON_EXISTS("DATA",'$?( (@.e.numberOnly() ==
$B0) && (@.r.numberOnly() > $B1) )' passing :2 as "B0", :3 as "B1"
type(strict)) order by JSON_QUERY("DATA", '$.s[*].min()') asc nulls
first, JSON_QUERY("DATA", '$.r[*].min()') asc nulls first fetch next 10
rows only

Plan hash value: 2462396944

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                               | Name                        | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                             |      1 |        |       |     3 (100)|          |     10 |00:00:00.98 |   98194 |       |       |          |
|*  1 |  COUNT STOPKEY                          |                             |      1 |        |       |            |          |     10 |00:00:00.98 |   98194 |       |       |          |
|   2 |   VIEW                                  |                             |      1 |      1 | 18314 |     3  (34)| 00:00:01 |     10 |00:00:00.98 |   98194 |       |       |          |
|*  3 |    SORT ORDER BY STOPKEY                |                             |      1 |      1 |   179 |     3  (34)| 00:00:01 |     10 |00:00:00.98 |   98194 |  4096 |  4096 | 4096  (0)|
|*  4 |     FILTER                              |                             |      1 |        |       |            |          |    166K|00:00:00.32 |   98194 |       |       |          |
|   5 |      TABLE ACCESS BY INDEX ROWID BATCHED| oneMillion                  |      1 |      1 |   179 |     2   (0)| 00:00:01 |    166K|00:00:00.30 |   98194 |       |       |          |
|   6 |       HASH UNIQUE                       |                             |      1 |      1 |   179 |            |          |    166K|00:00:00.11 |    2048 |   772K|   772K|          |
|*  7 |        INDEX RANGE SCAN (MULTI VALUE)   | $ora:oneMillion.e_1_s_1_r_1 |      1 |      1 |       |     2   (0)| 00:00:01 |    166K|00:00:00.08 |    2048 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Query Block Name / Object Alias (identified by operation id):
-------------------------------------------------------------

   1 - SEL$2
   2 - SEL$1 / "from$_subquery$_002"@"SEL$2"
   3 - SEL$1
   5 - SEL$1 / "oneMillion"@"SEL$1"
   7 - SEL$1 / "oneMillion"@"SEL$1"

Outline Data
-------------

  /*+
      BEGIN_OUTLINE_DATA
      IGNORE_OPTIM_EMBEDDED_HINTS
      OPTIMIZER_FEATURES_ENABLE('23.1.0')
      DB_VERSION('23.1.0')
      OPT_PARAM('_fix_control' '20648883:0 26552730:1 27175987:0 29972495:0 22387320:0 30195773:0 31945701:1 32108311:1 33659818:3 34092979:1 35495824:1 33792497:1 36554842:1 36283175:1
              31720959:1 36004220:1 36635255:1 36675198:1 36868551:1 37400112:1 37346200:0 37626161:1')
      FIRST_ROWS(10)
      FORCE_XML_QUERY_REWRITE
      FORCE_JSON_TABLE_TRANSFORM
      XML_DML_RWT_STMT
      XMLINDEX_REWRITE
      XMLINDEX_REWRITE_IN_SELECT
      NO_COST_XML_QUERY_REWRITE
      OUTLINE_LEAF(@"SEL$1")
      OUTLINE_LEAF(@"SEL$2")
      NO_ACCESS(@"SEL$2" "from$_subquery$_002"@"SEL$2")
      INDEX_RS_ASC(@"SEL$1" "oneMillion"@"SEL$1" "$ora:oneMillion.e_1_s_1_r_1")
      BATCH_TABLE_ACCESS_BY_ROWID(@"SEL$1" "oneMillion"@"SEL$1")
      USE_HASH_AGGREGATION(@"SEL$1" UNIQUE)
      END_OUTLINE_DATA
  */

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=10)
   3 - filter(ROWNUM<=10)
   4 - filter(HEXTORAW('04')>SYS_CONS_ANY_SCALAR(:3, 3))
   7 - access("oneMillion"."SYS_NC00005$"=SYS_CONS_ANY_SCALAR(:2, 3) AND "oneMillion"."SYS_NC00007$">SYS_CONS_ANY_SCALAR(:3, 3) AND "oneMillion"."SYS_NC00007$"<HEXTORAW('04'))
       filter(("oneMillion"."SYS_NC00007$"<HEXTORAW('04') AND "oneMillion"."SYS_NC00007$">SYS_CONS_ANY_SCALAR(:3, 3)))

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - "from$_subquery$_002"."JSON_PATCH("DATA",:1PROJECT)"[JSON,32600], "from$_subquery$_002"."RAWTOHEX("RESID")"[VARCHAR2,4000], "from$_subquery$_002"."ETAG"[RAW,16]
   2 - "from$_subquery$_002"."JSON_PATCH("DATA",:1PROJECT)"[JSON,32600], "from$_subquery$_002"."RAWTOHEX("RESID")"[VARCHAR2,4000], "from$_subquery$_002"."ETAG"[RAW,16]
   3 - (#keys=2) JSON_VALUE( /*+ QJSNMD5_TC_JCMP_JV */ JSON_QUERY("DATA" /*+ LOB_BY_VALUE */  FORMAT OSON , '$.s[*].min()' RETURNING JSON WITHOUT ARRAY WRAPPER NULL ON ERROR
       TYPE(LAX) ) FORMAT OSON , '$' RETURNING ANY ORA_RAWCOMPARE(32767) ERROR ON ERROR TYPE(LAX) )[32767], JSON_VALUE( /*+ QJSNMD5_TC_JCMP_JV */ JSON_QUERY("DATA" /*+ LOB_BY_VALUE */
       FORMAT OSON , '$.r[*].min()' RETURNING JSON WITHOUT ARRAY WRAPPER NULL ON ERROR TYPE(LAX