March 09, 2025

Franck Pachot

Comparing Execution Plans: MongoDB vs. Compatible APIs

MongoDB is the standard API for document databases, and some cloud providers have created services with a similar API. AWS and Azure call theirs 'DocumentDB', and Oracle provides the MongoDB API as a proxy on top of its SQL database. These databases offer a subset of features from past versions of MongoDB, but user experience and performance are also crucial.
Oracle claims better performance, but their tests lack real-world queries. I will utilize their "autonomous" managed service to compare execution plans and identify which minimizes unnecessary reads in a common OLTP use case - where clause and pagination.
TL;DR: MongoDB has better performance, and more indexing possibilities.

Document Model (Order - Order Detail)

I used a simple schema of orders and order lines, ideal for a document model. I illustrated it with UML notation to distinguish strong and weak entities, representing the association as a composition (⬧-).

In a SQL database, a one-to-many relationship between orders and line items requires two tables because the the first normal form. In MongoDB, a composition relationship allows the weak entity (Order Detail) to be embedded within the strong entity (Order) as an array, simplifying data management and enhancing performance, as we will see when indexing it.

I will insert orders with few attributes. The country and creation date are fields in Order. The line number, product, and quantity are fields of Detail, which is an embedded array in Order:

          +--------------------------+
          |        Order             |
          +--------------------------+
          | country_id: Number       |
          | created_at: Date         |
          | details: Array           |
          | +----------------------+ |
          | | Detail               | |
          | +----------------------+ |
          | | line: Number         | |
          | | product_id: Number   | |
          | | quantity: Number     | |
          | +----------------------+ |
          +--------------------------+

Sample Data

I generated one million documents for my example. I'll focus on predictable metrics like the number of documents examined rather than execution time, so that it can be easily reproduced with a small dataset. To simulate products with fluctuating popularity, I use a randomized logarithmic value to create product IDs:

const bulkOps = [];
for (let i = 0; i < 1000000; i++) {
  const orderDetails = [];
  for (let line = 1; line <= 10; line++) {
    orderDetails.push({
      line: line,
      product_id: Math.floor(Math.log2(1 + i * Math.random())),
      quantity: Math.floor(100 * Math.random()),
    });
  }
  bulkOps.push({
    insertOne: {
      document: {
        country_id: Math.floor(10 * Math.random()),
        created_at: new Date(),
        order_details: orderDetails
      }
    }
  });
}
db.orders.bulkWrite(bulkOps).insertedCount;

Access Pattern and ESR Index

Users seek insights into product usage through a query for the most recent orders that include a specific product, in a specific country. Following the ESR rule, I created an index with equality fields in front of the key, followed by the fields for ordering results.

db.orders.createIndex( { 
 "country_id": 1,
 "order_details.product_id": 1,
 "created_at": -1
});

Query the 10 last orders for a product / country

I queried the ten last orders in country 1 including product 5:

print(
db.orders.find({ 
 country_id: 1,
 order_details: { $elemMatch: { product_id: 5 } }
}).sort({ created_at: -1 }).limit(10)
);

Here was the result:

The user can analyze this data to understand the last orders for this product.

Execution plan for MongoDB API on Oracle Database

I query the execution plan for this query:

print(
db.orders.find( { 
 country_id: 1,
 order_details: { $elemMatch: { product_id: 5 } }
}).sort({ created_at: -1 }).limit(10)
.explain("executionStats")
);

When using the "MongoDB API for Oracle Database," the query is re-written to SQL since the collection resides in SQL tables with OSON data type, employing internal functions to simulate MongoDB BSON document. The explain() method reveals the executed queries:

This is interresting to understand how storing JSON in SQL databases is different from using MongoDB and requires an emulation layer that generates complex non-standard SQL. Unfortunately, this does not show execution statistics.

To gain more insights, I gathered the SQL statement from V$SQL and ran it with the MONITOR hint to generate a SQL Monitor report:

select /*+ FIRST_ROWS(10) MONITOR */ "DATA",rawtohex("RESID"),"ETAG"
 from "ORA"."orders"
 where JSON_EXISTS("DATA"
,'$?( (@.country_id.numberOnly() == $B0) && 
( exists(@.order_details[*]?( (@.product_id.numberOnly() == $B1) )) ) )' passing 1 as "B0", 5 as "B1" type(strict))
 order by JSON_QUERY("DATA", '$.created_at[*].max()') desc nulls last
 fetch next 10 rows only
;

Here is the SQL Monitor report:

276 rows have been read from the index (INDEX RANGE SCAN). The access predicates are internal virtual columns and undocumented functions to apply the equality conditions: "orders"."SYS_NC00005$" = SYS_CONS_ANY_SCALAR(1, 3) AND "orders"."SYS_NC00006$" = SYS_CONS_ANY_SCALAR(5, 3).
The index entries go though a deduplication step (HASH UNIQUE).
276 documents are fetched from the SQL table (TABLE ACCESS BY ROWID).
They are finally sorted for Top-k (SORT ORDER BY STOPKEY) to return 31 documents, from which 10 are fetched to provide the result.

More operations occur to transform this result into MongoDB-compatible documents, but this happens on 10 documents as it occurs after the limit (COUNT STOPKEY). What is more problematic is what happens before, unnecessary work: read, hash, and sort the rows that do not participate to the result. It is 276 instead of 10 in this small example, but can be more in a larger database. The SORT ORDER BY is a blocking operation that must read all rows before being able to output one.

Oracle Database's index usage does not adhere to the MongoDB ESR (Equality, Sort, Range) rule. It only utilized index scans for the equality predicates, country_id and product_id, rendering the created_at field ineffective and failing to prevent a blocking sort operation. This occurs on 276 rows in this small example, but it can impacts more in a production database.
Since the index is only part of the filtering process, the execution plan may switch to another index. For example, an index starting with created_at could assist with sort().limit(), but may read too many countries and products.

The result from Oracle Database is compatible with MongoDB, but the performance and scalability is not.

Execution plan for MongoDB

Here is the execution plan on a real MongoDB database - I've run it on an Atlas free cluster:

db> print(
... db.orders.find( { 
...  country_id: 1,
...  order_details: { $elemMatch: { product_id: 5 } }
... }).sort({ created_at: -1 }).limit(10)
... .explain("executionStats").executionStats
... );
{
  executionSuccess: true,
  nReturned: 10,
  executionTimeMillis: 0,
  totalKeysExamined: 10,
  totalDocsExamined: 10,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 10,
    executionTimeMillisEstimate: 0,
    works: 11,
    advanced: 10,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    limitAmount: 10,
    inputStage: {
      stage: 'FETCH',
      filter: {
        order_details: { '$elemMatch': { product_id: { '$eq': 5 } } }
      },
      nReturned: 10,
      executionTimeMillisEstimate: 0,
      works: 10,
      advanced: 10,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 0,
      docsExamined: 10,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 10,
        executionTimeMillisEstimate: 0,
        works: 10,
        advanced: 10,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 0,
        keyPattern: {
          country_id: 1,
          'order_details.product_id': 1,
          created_at: -1
        },
        indexName: 'country_id_1_order_details.product_id_1_created_at_-1',
        isMultiKey: true,
        multiKeyPaths: {
          country_id: [],
          'order_details.product_id': [ 'order_details' ],
          created_at: []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          country_id: [ '[1, 1]' ],
          'order_details.product_id': [ '[5, 5]' ],
          created_at: [ '[MaxKey, MinKey]' ]
        },
        keysExamined: 10,
        seeks: 1,
        dupsTested: 10,
        dupsDropped: 0
      }
    }
  }
}

Here MongoDB didn't read more rows than necessary:

Index scan (stage: 'IXSCAN') with a single access (seeks: 1) to the values of the equality condition.
Read only the ten index entries (keysExamined: 10) needed for the result.
No sort operation, the ten documents (nReturned: 10) are read (stage: 'FETCH') sorted on the index key.

This is summarized by:

  executionSuccess: true,
  nReturned: 10,
  totalKeysExamined: 10,
  totalDocsExamined: 10,

When the number of keys examined matches the number of documents returned, it indicates optimal execution with no unnecessary operations.
This alignment ensures efficiency in processing, as all examined keys are relevant to the returned documents.

You can also look at the visual execution plan in MongoDB Compass:

Using document data modeling and MongoDB indexes allows you to access only the necessary data, ensuring that query costs are directly related to the results obtained.

Conclusion on Documents (vs. relational) and MongoDB (vs. emulations)

In a SQL database, the Orders - Order Details example requires two tables and a join to filter results. The join itself may not be too expensive, but SQL databases lack multi-table indexes. They do unnecessary work reading and joining rows that will be discarded later.

The document model, with embedded entities, allows for comprehensive indexing, offering optimal access unlike normalized tables in relational databases. MongoDB shines with indexes that follow Equality, Sort, and Range, and they can cover documents and sub-documents, with multiple keys per document.

While some SQL databases have copied the MongoDB API to provide better developer experience to those who are not SQL experts, they do not gain the same benefits as MongoDB, provide fewer indexing possibilities, and incur additional operations when executing queries.

by Franck Pachot

March 07, 2025

AWS Database Blog - Amazon Aurora

Long-term backup options for Amazon RDS and Amazon Aurora

In this post, we show you several long-term data backup strategies and how to effectively implement them in the AWS environment, with a focus on Amazon Relational Database Service (Amazon RDS) and Amazon Aurora.

by Babaiah Valluru

Tinybird Engineering Blog

How to run load tests in real-time data systems

We have run hundreds of load tests for customers processing petabytes of data in real-time. Here's everything you need to know to plan, execute, and analyze a load test in a real-time data system.

Supabase Blog

Dedicated Poolers

A dedicated pgbouncer instance that's co-located with your database for maximum performance and reliability.

March 06, 2025

Percona Database Performance Blog

Simplify Valkey Adoption with Percona Support and Services

With Redis moving away from its open source roots, the key-value store ecosystem is shifting. For over 70% of users, Redis’s recent licensing changes have created an urgent need for a cost-effective, fully open source alternative that avoids vendor lock-in. That’s where Valkey comes in. As a Redis fork that stays true to open source […]

by Aleksandra Mitroshkina

CedarDB Blog

To B or not to B: B-Trees with Optimistic Lock Coupling

B-Trees are one of the few specialized data structures which truly stood the test of time, being over 50 years old. They were invented in year zero of unix time, on a system with a whopping 7.25MB of disk storage — Not DRAM! They are only one month younger than the idea of relational databases itself and witnessed the birth of multi-core architectures, complex cache hierarchies, fast network attached storage and many more groundbreaking changes. However, while most technologies would show their age and can comfortably retire, B-Trees are still ubiquitous for data storage.

March 03, 2025

Tinybird Engineering Blog

We added the Backup Database engine to ClickHouse

We added a new ClickHouse engine: Backup. Here's why we did it, how it's implemented, and example usage.

Small Datum - Mark Callaghan

Postgres 17.4 vs sysbench on a large server

Postgres has done a great job at avoiding performance regressions over time. It has results from sysbench and a large server for all point releases from Postgres 17.x and then the latest point release from Postgres 10 through 16.

This work was done by Small Datum LLC and not sponsored.

tl;dr - over time ...

For the 27 microbenchmarks there is one regression that arrives in 11.x and remains through 17.x
Postgres has many big improvements

Updates:

I am still trying to explain the regression but the problem isn't obvious.
The regression first arrives in Postgres 11.0
Flamegraphs don't show an obvious problem
Perf HW counters show an increase in memory system stalls

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 10.23, 11.22, 12.22, 13.20, 14.17, 15.12, 16.8, 17.0, 17.1, 17.2, 17.3 and 17.4.

The server is an ax162-s from Hetzner with 48 cores (AMD EPYC 9454P), 128G RAM and AMD SMT disabled. It uses Ubuntu 22.04 and storage is ext4 with SW RAID 1 over 2 locally attached NVMe devices. More details on it are here. At list prices a similar server from Google Cloud costs 10X more than from Hetzner.

The configuration files for the large server are in the pg* subdirectories here with the name conf.diff.cx10a_c32r128.

Benchmark

I used sysbench and my usage is explained here. To save time I only run 27 of the 42 microbenchmarks and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The tests run with 8 tables and 10M rows/table. There are 40 client threads, read-heavy microbenchmarks run for 180 seconds and write-heavy run for 300 seconds.

The command line to run all tests is: bash r.sh 8 10000000 180 300 md2 1 1 40

Results

For the results below I split the microbenchmarks into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. The spreadsheet with all data is here.

Values from iostat and vmstat divided by QPS are here. This can help to explain why something is faster or slower because it shows how much HW is used per request.

The relative QPS is the following where $version is >= 11.22.

(QPS for $version) / (QPS for Postgres 10.23)

The numbers in the spreadsheets are the relative QPS. When the relative QPS is > 1 then $version is faster than Postgres 10.23. When it is 3.0 then $version is 3X faster than the base case.

Results: charts

Notes on the charts

the y-axis shows the relative QPS
the y-axis starts at 0.80 to make it easier to see differences
in some cases the y-axis truncates the good outliers, cases where the relative QPS is greater than 1.5. I do this to improve readability for values near 1.0. Regardless, the improvements are nice.

Point queries

performance is stable over time in most cases
performance gets much better for hot-points with Postgres 17.0

Range queries, part 1

performance is stable over time with small improvements for most tests
performance gets ~1.2X better over time on the scan test

Range queries, part 2

these do a short range scan with aggregation and the =10, =100 and =10000 is the number of rows scanned per query
performance is stable over time on the tests that do shorter range scans (=100, =10)
performance drops by ~10% starting in 11.22 on the test with a longer range scan (=10000)

Writes

depending on the test, performance over time is either stable, has a small improvement or has a large improvement

A possible regression!

I have yet to explain this.

For the read-only_range=10000 tests where QPS drops by ~10% starting in 11.22 the problem is that Postgres uses more CPU per query starting in 11.22 (see the cpu/o column which stands for CPU per operation or CPU per query).

Metrics per test from vmstat and iostat are here and I highlighted results for the read-only_range=X tests that do a range scan with aggregation.

Things I have checked for read-only_range=10000

output from ps for 10.23, 11.22 and 12.22 looks similar
output from vmstat for 10.23, 11.22 and 12.22 looks similar with a few small differences. This output is from the middle of the microbenchmark.

both swpd and free are larger in 11.22 and 12.22
both buff and cache are larger in 10.23
for 10.23, 11.22 and 12.22 those values don't change during the microbenchmark

Flamegraphs for 10.23, 11.22 and 12.22 are here and look similar. The numbered (*.1.svg, *.2.svg) files are taken in sequence and then all are combined to create the *.all.svg file.
Output with perf HW counters for 10.23, 11.0 and 11.10 are here. These show there are more memory system stalls starting in 11.0. I haven't explained the file naming scheme, but the data in the table below is from this file.
Flat (no flamegraphs) perf profiles are here for 10.23, 11.0 and 11.10. There are ~8 samples per DBMS version and the 7th sample is here for 10.23, for 11.0 and for 11.10. From a quick check of them I don't see obvious problems.

I repeated tests for Postgres 11.0 and 11.10 and they both show a regression from 11.23. Then I repeated tests for 10.23, 11.0 and 11.10 while running perf to get the values of HW counters per microbenchmark and several show an increase in memory system stalls:

Perf HW counters that show more overhead staring in Postgres 11.0. The numbers below are the value relative to Postgres 10.23, when it is 1.257 below that that counter is 1.257X larger in 11.0 (or 11.10) then it was in 10.23.

pg11.0 pg11.10 counter

1.257 1.288 branch-misses

1.309 1.345 branch-miss-pct

1.222 1.120 iTLB-load-miss-pct

1.166 1.174 L1-dcache-load-misses

1.162 1.145 L1-dcache-load-miss-pct

1.207 1.207 L1-icache-load-miss-pct

1.254 1.136 L1-icache-loads-misses

by Mark Callaghan (noreply@blogger.com)

Percona Database Performance Blog

Why Companies Are Saying No to Proprietary Software, Including MongoDB

Is your organization’s innovation being constrained by proprietary software restrictions? You’re not alone—these limitations are increasingly at odds with enterprise growth objectives, affecting both bottom lines and competitive agility. For years, proprietary software has dominated the enterprise database space, with MongoDB emerging as one of the most well-known players. Its powerful combination of scalability, flexibility, […]

by David Quilty

February 28, 2025

Murat Demirbas

Our K-Drama Journey

Finding a show that suits everyone in our family has always been a challenge. Intense action/stress is too much for our youngest daughter, and we prefer to keep things PG-13.

We’ve finally found a great solution: Korean-dramas. We’ve been watching them back to back recently, and they’ve been a hit across all ages.

Hometown Cha-Cha-Cha (2021) This was our gateway into K-dramas. Since it’s dubbed in English, it was easy to start. Set in a small seaside town, it’s a wholesome, family-friendly show with both comedic moments and touching scenes. It also gave us a glimpse into everyday Korean life. It's a solid feel-good watch.

Queen of Tears (2024) We picked this next because it was also dubbed. This one was a commitment (16 episodes, each 1.5 hours long) but it was worth it. It had more romance, more drama, better cinematography, and a stellar cast. The intense drama in some scenes made our youngest bawling. We got so hooked that we ended up binge-watching the last six episodes over a weekend.

Start-Up (2020) This is our first subtitled K-drama, so now we’re listening to the original Korean audio. Another love triangle with a childhood connection, a decent plot, and comedic moments. Again sixteen 1.5 hours episodes. One of the main actors was also in Hometown Cha-Cha-Cha, so we’re starting to recognize actors.

Observations on K-Dramas

Recurring Themes: Early love, childhood connections, and fated relationships. The main couple is always "meant to be."

Female Leads Have Erratic Behavior: Is it just me, or do Korean female leads suddenly become very clingy and irrational in romance? They expect grand gestures and act capriciously. Is this cultural, or just a K-drama trope?

Male Leads Are Idealized: The male protagonists are always near-perfect—handsome, successful, and incredibly capable. Doesn't this set unrealistic expectations for real-life relationships?

Drinking Culture: Drinking seems to be a big part of Korean life in these shows. Whenever characters are stressed or sad, they get completely wasted. And this is portrayed as funny, rather than problematic.

Seoul National University: The two (maybe all three) had the male lead character finish Seoul National University. Must be a big thing.

Hard Work and Cow Dung: For some reason these get featured back to back in K-dramas, the latter as comedic relief I guess.

Quirky eating: The more dramatically the male lead slurps his noodles, the more the female lead seems drawn to him. Is this an actual thing?

The Korean Language & Culture: Korean sounds familiar at times, possibly due to its distant connection to Turkish. Korean also sounds like the ideal language for ranting. Finally, the honorifics system is fascinating.

Better Than Turkish Dramas: We never clicked with Turkish dramas: too many long stares, very strained plots, and huge plot holes. K-dramas have better pacing and more engaging storylines.

Coda

Watching K-dramas has been a great experience. They spark discussions, keep us engaged in predicting plot twists, and help the kids develop storytelling and observational skills. Watching these shows also makes us want to visit Korea.

Got any recommendations for what we should watch next? Of course, my son and I watched Squid Game together, but that’s definitely not family-friendly. These more traditional K-dramas have been a much better fit.

by Murat (noreply@blogger.com)

Phil Eaton

Minimal downtime Postgres major version upgrades with EDB Postgres Distributed

This is an external post of mine. Click here if you are not redirected.

February 27, 2025

Franck Pachot

Order-preserving or-expansion in MongoDB and $in:[,,] treated as a range skip scan, or exploded to an equality for the ESR rule

I concluded the previous post with a question. We have explored how queries with range and equality filters behave differently based on the order of fields in the index key. Now, let’s examine queries that use the $in[] operator.
This operator can be viewed as an equality filter that handles multiple values or as a range filter that skips scans across multiple ranges. In the previous post, we saw that MongoDB can use an index scan for both.

I use the same collection as in the previous post:

mdb> db.demoesr.drop();

mdb> db.demoesr.insertMany([
 { e:42, s:"a", r:10 },
 { e:42, s:"b", r:20 },
 { e:42, s:"b", r:10 },
 { e:42, s:"d", r:30 },
 { e:99,        r:11 }
]);

I will query one value of "e" for various values of "r" and sort by "s". Rather than creating an Equality, Sort, Range index (ESR), I believe the Range condition is much more selective and crucial than the sort, so I will create an index with the order of Equality, Range, Sort (ERS):

test> db.demoesr.createIndex(
      { e: 1 , r: 1, s : 1 }
);
e_1_r_1_s_1

This post aims to determine whether a $in[] functions similarly to an Equality filter with multiple values or resembles a Range filter that skips some ranges.
I run the following query where e:{ $eq: 42 } and r:{ $in:[...]} with a list of more than 200 values:

test> db.demoesr.find(
      { e:{ $eq: 42 }, r:{ $in:[
201, 200, 199, 198, 197, 196, 195, 194, 193, 192, 191, 190, 189, 188, 187, 186, 185, 184, 183, 182, 181, 180, 179, 178, 177, 176, 175, 174, 173, 172, 171, 170, 169, 168, 167, 166, 165, 164, 163, 162, 161, 160, 159, 158, 157, 156, 155, 154, 153, 152, 151, 150, 149, 148, 147, 146, 145, 144, 143, 142, 141, 140, 139, 138, 137, 136, 135, 134, 133, 132, 131, 130, 129, 128, 127, 126, 125, 124, 123, 122, 121, 120, 119, 118, 117, 116, 115, 114, 113, 112, 111, 110, 109, 108, 107, 106, 105, 104, 103, 102, 101, 100, 99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 83, 82, 81, 80, 79, 78, 77, 76, 75, 74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
      ] } } , 
      { e:1,s:1,r:1,"_id":0 }
      ).sort({ s: 1 }).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 4,
  totalKeysExamined: 5,
  totalDocsExamined: 0,
  executionStages: {
    stage: 'SORT',
    nReturned: 4,
    sortPattern: { s: 1 },
    inputStage: {
      stage: 'PROJECTION_COVERED',
      nReturned: 4,
      transformBy: { e: 1, s: 1, r: 1, _id: 0 },
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 4,
        keyPattern: { e: 1, r: 1, s: 1 },
        indexName: 'e_1_r_1_s_1',
        isMultiKey: false,
        multiKeyPaths: { e: [], r: [], s: [] },
        direction: 'forward',
        indexBounds: {
          e: [ '[42, 42]' ],
          r: [
            '[1, 1]',   '[2, 2]',   '[3, 3]',   '[4, 4]',   '[5, 5]',
            '[6, 6]',   '[7, 7]',   '[8, 8]',   '[9, 9]',   '[10, 10]',
            '[11, 11]', '[12, 12]', '[13, 13]', '[14, 14]', '[15, 15]',
            '[16, 16]', '[17, 17]', '[18, 18]', '[19, 19]', '[20, 20]',
            '[21, 21]', '[22, 22]', '[23, 23]', '[24, 24]', '[25, 25]',
            '[26, 26]', '[27, 27]', '[28, 28]', '[29, 29]', '[30, 30]',
            '[31, 31]', '[32, 32]', '[33, 33]', '[34, 34]', '[35, 35]',
            '[36, 36]', '[37, 37]', '[38, 38]', '[39, 39]', '[40, 40]',
            '[41, 41]', '[42, 42]', '[43, 43]', '[44, 44]', '[45, 45]',
            '[46, 46]', '[47, 47]', '[48, 48]', '[49, 49]', '[50, 50]',
            '[51, 51]', '[52, 52]', '[53, 53]', '[54, 54]', '[55, 55]',
            '[56, 56]', '[57, 57]', '[58, 58]', '[59, 59]', '[60, 60]',
            '[61, 61]', '[62, 62]', '[63, 63]', '[64, 64]', '[65, 65]',
            '[66, 66]', '[67, 67]', '[68, 68]', '[69, 69]', '[70, 70]',
            '[71, 71]', '[72, 72]', '[73, 73]', '[74, 74]', '[75, 75]',
            '[76, 76]', '[77, 77]', '[78, 78]', '[79, 79]', '[80, 80]',
            '[81, 81]', '[82, 82]', '[83, 83]', '[84, 84]', '[85, 85]',
            '[86, 86]', '[87, 87]', '[88, 88]', '[89, 89]', '[90, 90]',
            '[91, 91]', '[92, 92]', '[93, 93]', '[94, 94]', '[95, 95]',
            '[96, 96]', '[97, 97]', '[98, 98]', '[99, 99]', '[100, 100]',
            ... 101 more items
          ],
          s: [ '[MinKey, MaxKey]' ]
        },
        keysExamined: 5,
        seeks: 1
      }
    }
  }
}

This is the best way to filter the documents for retrieval, utilizing multiple ranges with a skip scan as discussed in the previous post. However, because it doesn't adhere to the ESR rule, the index failed to return the entries in the expected order, necessitating an additional SORT operation.

Sort Merge over a OR Expansion

I used more than 200 values to demonstrate the general behavior of the $in[] operator within the IXSCAN range. MongoDB optimizes performance when the list contains 200 values or fewer.

Here is the execution plan with a smaller list:

test> db.demoesr.find(
      { e:{ $eq: 42 }, r:{ $in:[10,30] } } , 
      { e:1,s:1,r:1,"_id":0 }
      ).sort({ s: 1 }).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 3,
  totalKeysExamined: 3,
  totalDocsExamined: 0,
  executionStages: {
    stage: 'PROJECTION_DEFAULT',
    nReturned: 3,
    transformBy: { e: 1, s: 1, r: 1, _id: 0 },
    inputStage: {
      stage: 'SORT_MERGE',
      nReturned: 3,
      sortPattern: { s: 1 },
      inputStages: [
        {
          stage: 'IXSCAN',
          nReturned: 2,
          keyPattern: { e: 1, r: 1, s: 1 },
          indexName: 'e_1_r_1_s_1',
          isMultiKey: false,
          multiKeyPaths: { e: [], r: [], s: [] },
          direction: 'forward',
          indexBounds: {
            e: [ '[42, 42]' ],
            r: [ '[10, 10]' ],
            s: [ '[MinKey, MaxKey]' ]
          },
          keysExamined: 2,
          seeks: 1
        },
        {
          stage: 'IXSCAN',
          nReturned: 1,
          keyPattern: { e: 1, r: 1, s: 1 },
          indexName: 'e_1_r_1_s_1',
          isMultiKey: false,
          multiKeyPaths: { e: [], r: [], s: [] },
          indexBounds: {
            e: [ '[42, 42]' ],
            r: [ '[30, 30]' ],
            s: [ '[MinKey, MaxKey]' ]
          },
          keysExamined: 1,
          seeks: 1
        }
      ]
    }
  }
}

When processing 200 or fewer values, MongoDB treats them as one equality filter for each value, utilizing a separate IXSCAN. If the following field in the index key is used to sort the result, each index scan returns the keys in the correct sort order. MongoDB reads from these sorted segments and merges them while preserving the order, effectively performing a SORT_MERGE to bypass a sort operation.
Since each IXSCAN treats the $in[] value as an Equality instead of a Range, my index configuration consists of Equality, Equality, Sort rather than Equality, Range, Sort. It follows the Equality, Sort, Range (ESR) rule.

I've rarely seen such intelligent optimization in a database before. Oracle Database can convert a list into a concatenation (UNION ALL) using a technique known as OR Expansion, yet it lacks a merge sort feature on top of it. On the other hand, PostgreSQL can perform a merge sort on concatenated UNION but does not support the OR Expansion transformation to get it transparently. YugabyteDB employs a similar method for batching the nested loop join by pushing down the join condition as an array.
The internal name of this feature in MongoDB query planner is Explode for Sort, as a combination of Index Scan Explosion and Merge Sort and is advantageous with pagination when the limit is pushed down to each index scan.
It can be used by $or:{} or $in:[]. For example, the following query read one key from each index scan:

test> db.demoesr.find(
 {
  e: { $eq: 42 },
  $or: [ { r: 10 } , { r: 30 } ]
 }).sort( { s: 1 } ).limit(1);

{
  nReturned: 1,
  totalKeysExamined: 2,
  totalDocsExamined: 0,
  executionStages: {
    stage: 'LIMIT',
    nReturned: 1,
    limitAmount: 1,
    inputStage: {
      stage: 'PROJECTION_DEFAULT',
      nReturned: 1,
      transformBy: { e: 1, s: 1, r: 1, _id: 0 },
      inputStage: {
        stage: 'SORT_MERGE',
        nReturned: 1,
        sortPattern: { s: 1 },
        inputStages: [
          {
            stage: 'IXSCAN',
            nReturned: 1,
            keysExamined: 1,
            seeks: 1
          },
          {
            stage: 'IXSCAN',
            nReturned: 1,
            keysExamined: 1,
            seeks: 1
          }
        ]
      }
    }
  }
}

This feature is also documented in the ESR rule additional considerations: https://www.mongodb.com/docs/manual/tutorial/equality-sort-range-rule/#additional-considerations

by Franck Pachot

Franck Pachot

Skip Scan in MongoDB with Range, Equality (RE) index as an alternative to Equality, Sort, Range (ESR)

To achieve the ideal index for each query while adhering to the Equality, Sort, and Range (ESR) rule, you may end up with too many indexes. Instead of creating the best index for every query, the most effective indexing strategy is using a small set of indexes that addresses performance and scalability requirements.
Indexes that start with a field not used as a filter can still be beneficial if the column has a limited range of values. The index scans the entries based on the index key's following field and skips the first field's values to seek the following range. We will illustrate this using a similar collection as in the previous post:

mdb> db.demoesr.drop();

mdb> db.demoesr.insertMany([
 { e:42, s:"a", r:10 },
 { e:42, s:"b", r:20 },
 { e:42, s:"b", r:10 },
 { e:42, s:"d", r:30 },
 { e:99,        r:11 }  // add one to skip
]);

My query finds the document with e = 42 and r > 10:

test> db.demoesr.find(
      { r:{$gt:10} , e:{$eq: 42} } , 
      {e:1,s:1,r:1,"_id":0 }
      )
;
[
  { e: 42, s: 'b', r: 20 },
  { e: 42, s: 'd', r: 30 }
]

Without an index, this reads all documents with a COLLSCAN:

db.demoesr.find(
      { r:{$gt:10} , e:{$eq: 42} } , 
      {e:1,s:1,r:1,"_id":0 }
      ).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 2,
  totalKeysExamined: 0,
  totalDocsExamined: 5,
  executionStages: {
    stage: 'PROJECTION_SIMPLE',
    nReturned: 2,
    transformBy: { e: 1, s: 1, r: 1, _id: 0 },
    inputStage: {
      stage: 'COLLSCAN',
      filter: {
        '$and': [ { e: { '$eq': 42 } }, { r: { '$gt': 10 } } ]
      },
      nReturned: 2,
      direction: 'forward',
      docsExamined: 5
    }
  }
}

Although my collection for the demo is limited, the execution plan highlights its scalability challenge. It examines all documents but only returns a subset, which lacks efficiency.

ESR index (Equality, Sort, Range)

With the index created in the previous post, my query uses an index scan:

test> db.demoesr.createIndex(
      { e: 1 , s: 1, r : 1 }
);
e_1_s_1_r_1

test> db.demoesr.find(
      { r:{$gt:10} , e:{$eq: 42} } , 
      {e:1,s:1,r:1,"_id":0 }
      ).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 2,
  totalKeysExamined: 5,
  totalDocsExamined: 0,
  executionStages: {
    stage: 'PROJECTION_COVERED',
    nReturned: 2,
    transformBy: { e: 1, s: 1, r: 1, _id: 0 },
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 2,
      keyPattern: { e: 1, s: 1, r: 1 },
      indexName: 'e_1_s_1_r_1',
      isMultiKey: false,
      multiKeyPaths: { e: [], s: [], r: [] },
      direction: 'forward',
      indexBounds: {
        e: [ '[42, 42]' ],
        s: [ '[MinKey, MaxKey]' ],
        r: [ '(10, inf.0]' ]
      },
      keysExamined: 5,
      seeks: 3
    }
  }
}

This plan is acceptable if the equality part (e: [ '[42, 42]' ]) is highly selective. Since the subsequent range is not filtering, all values were read. However, a skip scan was used to seek to the next value at the end of each range of the next bound (r: [ '(10, inf.0]' ]).

ER index (Equality, Range)

Since I don't do any sort and read all values for "s", having an index on the two fields used for filtering is preferable:

test> db.demoesr.createIndex(
      { e: 1 , r : 1 }
);
e_1_r_1

test> db.demoesr.find(
      { r:{$gt:10} , e:{$eq: 42} } , 
      {e:1,s:1,r:1,"_id":0 }
      ).explain("executionStats").executionStats
;
{
  executionSuccess: true,
  nReturned: 2,
  executionTimeMillis: 3,
  totalKeysExamined: 2,
  totalDocsExamined: 2,
  executionStages: {
    stage: 'PROJECTION_SIMPLE',
    nReturned: 2,
    executionTimeMillisEstimate: 0,
    works: 4,
    advanced: 2,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    transformBy: { e: 1, s: 1, r: 1, _id: 0 },
    inputStage: {
      stage: 'FETCH',
      nReturned: 2,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 2,
        keyPattern: { e: 1, r: 1 },
        indexName: 'e_1_r_1',
        isMultiKey: false,
        multiKeyPaths: { e: [], r: [] },
        direction: 'forward',
        indexBounds: { e: [ '[42, 42]' ], r: [ '(10, inf.0]' ] },
        keysExamined: 2,
        seeks: 1
      }
    }
  }
}

This decreased the number of seeks since no values were skipped. It reads a single range (e: [ '[42, 42]' ], r: [ '(10, inf.0]' ]) encompassing all the keys for the returned documents. You achieve the optimal index for your filter when seeks: 1 and keysExamined = nReturned.

RE index (Range, Equality)

Now, imagine that a similar index was created for another critical query with the fields arranged in a different order, as this use case had an equality predicate on "r" and reads many values of "e". I don't want to create another index for my query if I don't need to, so let's remove the indexes created above and check the execution plan with an index starting with "r":

test> db.demoesr.dropIndexes("*");
{
  nIndexesWas: 3,
  msg: 'non-_id indexes dropped for collection',
  ok: 1
}

test> db.demoesr.createIndex(
      { r: 1, e : 1 }
);

test> db.demoesr.find(
      { r:{$gt:10} , e:{$eq: 42} } , 
      {e:1,s:1,r:1,"_id":0 }
      ).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 2,
  totalKeysExamined: 3,
  totalDocsExamined: 2,
  executionStages: {
    stage: 'PROJECTION_SIMPLE',
    nReturned: 2,
    transformBy: { e: 1, s: 1, r: 1, _id: 0 },
    inputStage: {
      stage: 'FETCH',
      nReturned: 2,
      docsExamined: 2,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 2,
        keyPattern: { r: 1, e: 1 },
        indexName: 'r_1_e_1',
        isMultiKey: false,
        multiKeyPaths: { r: [], e: [] },
        direction: 'forward',
        indexBounds: { r: [ '(10, inf.0]' ], e: [ '[42, 42]' ] },
        keysExamined: 3,
        seeks: 2
      }
    }
  }
}

In my case, I have only one value for "e" to skip, as { e:99, r:11 } is in the first range r: [ '(10, inf.0]' ] but not in the following one e: [ '[42, 42]' ], which is evident from one additional seek. If there aren't too many of those, this index remains efficient, and there's no need to create an additional one with a better key field ordering for this query.

Keep in mind that if there's no range predicate in your query, the IXCAN cannot be utilized, as we discussed in the previous post. Without an index that begins with "e", this will result in a COLLSCAN:

db.demoesr.find( 
 { e:{$eq: 42} } , {e:1,s:1,r:1,"_id":0 } 
).explain("executionStats").executionStats;

The workaround remains unchanged from the previous post, which involves adding an infinite range filter to the first field of the index:

db.demoesr.find( 
 { r:{$gte:MinKey}, e:{$eq: 42} } , {e:1,s:1,r:1,"_id":0 } 
).sort({s:1}).explain("executionStats").executionStats;

We have seen equality predicates, where the index scan can seek directly to the desired point, and range predicates, where multiple values can be examined but skipped according to the following ranges. You may wonder if a filter on a list of values ($in) is processed like multiple equalities or a single range with skips. That will be the next topic.

by Franck Pachot

AWS Database Blog - Amazon Aurora

Migrate very large databases to Amazon Aurora MySQL using MyDumper and MyLoader

In this post, we discuss how to migrate MySQL very large databases (VLDBs) from a self-managed MySQL database to Amazon Aurora MySQL-Compatible Edition using the MyDumper and MyLoader tools.

by Maria Ehsan

Tinybird Engineering Blog

Outgrowing Postgres: How to optimize and integrate an OLTP + OLAP stack

Navigate the complexities of OLTP and OLAP integration by choosing simple, scalable data movement patterns that reduce infrastructure overhead and keep your focus on building great products for users.

Alex Miller

Talks: Enough With All The Raft

There's better ways to replicate data than Raft.

February 26, 2025

AWS Database Blog - Amazon Aurora

Upgrade strategies for Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL 12

In this post, we explore the end-of-life (EOL) timeline for Aurora PostgreSQL and Amazon RDS for PostgreSQL. We discuss features in PostgreSQL major versions, Amazon RDS Extended Support, and various upgrade strategies, including in-place upgrades, Amazon RDS blue/green deployments, and out-of-place upgrades.

by Vivek Singh

Murat Demirbas

Smart Casual Verification of the Confidential Consortium Framework

This paper (NSDI'25) applies lightweight formal methods (hence the pun "smart casual" in contrast to formal attire) to the Confidential Consortium Framework (CCF). CCF is an open-source platform for trustworthy cloud applications, used in Microsoft's Azure Confidential Ledger service. The authors combine formal specification, model checking, and automated testing to validate CCF's distributed protocols, specifically its custom consensus protocol.

Background

CCF uses trusted execution environments (TEEs) and state machine replication. Its consensus protocol is based on Raft but has major modifications: signature transactions (Merkle tree root over the whole log signed by the leader), optimistic acknowledgments, and partition leader step-down. CCF also avoids RPCs, using a uni-directional messaging layer. These changes add complexity, requiring formal verification. At the time of writing, CCF's implementation spans 63 kLoC in C++.

The authors define five requirements for CCF verification: verifying safety properties, documenting system behavior, increasing confidence in implementation, integrating with the codebase, and evolving with the system. They use TLA+ for formal specification and model checking. TLA+ specifies system behavior using temporal logic, verifying safety and liveness properties via model checking and simulation. However, verifying the protocol does not guarantee the implementation conforms to it.

Bridging this gap requires conformance checking, which can be done in two ways: test-case generation from the spec or trace validation of the implementation. MongoDB's VLDB'20 paper titled "eXtreme Modelling in Practice" explored both approaches, and is worth a look. The last four years brought significant progress in the applicability of both methods. I think deterministic simulations getting mature and gaining popularity helped both approaches.

Ok, the first approach, test-case generation, uses spec executions to create tests, which are then run against the implementation. Conformance is verified by comparing the implementation's results to those of the spec. This approach requires a highly controllable and observable implementation, effectively eliminating all non-determinism. This is most applicable and effective for single-node implementation components. Recently, at MongoDB Research, we leveraged our distributed transactions specs to generate test cases for the WiredTiger storage component testing. This compositional spec work, led by Will Schultz, will be submitted for publication soon.

This current Microsoft paper focuses on the second approach, trace validation, which works in reverse. It begins with an implementation trace and verifies whether that trace is permitted by a spec execution. This method presents its own challenges, often requiring the spec to be sufficiently detailed to align with the implementation. Consequently, it typically necessitates a "medium-level" spec rather than a more abstract, high-level one.

Distributed Consensus Specification

The TLA+ consensus specification models CCF's protocol with 17 actions and 13 variables. (TLA+ specifications from the project are available on GitHub.) The authors extend the spec for simulation and model checking to explore a wide range of behaviors. Simulation weights actions to increase coverage of rare but critical scenarios.

Trace validation checks if implementation traces match the system’s specification. Since collecting traces from a live production distributed system is difficult, the authors use traces from existing CCF tests. CCF had extensive unit, functional, and end-to-end tests before this work. Consensus testing used a scenario driver that serialized execution deterministically and mocked unrelated components, allowing controlled network faults and observability.

Trace validation verifies whether the behaviors in a trace (T) intersect with those allowed by the spec (S), ensuring T ∩ S ≠ ∅. TLC (TLA's explicit state model checker) constructs S, but cannot generate T from a trace. To address this, the authors wrote a new TLA+ spec, "Trace," reusing the spec actions but constraining them to observed events. The actions are only enabled iff the current event in the trace matches an action, and the actions are parameterized by the values taken from the trace.

The paper reports that aligning spec actions with trace events was still challenging and they had to address granularity mismatches. Invalid traces could still result from incorrect logging, faulty mappings, or genuine spec-implementation mismatches. Unlike model checking, failed validation provided no counterexample, but behaviors in T helped diagnose the issue.

State-space explosion made validating traces computationally expensive. To address this, the authors used depth-first search (DFS) in TLC instead of breadth-first search (BFS). This significantly improving performance. For example, validating a consistency trace took under a second with DFS, compared to an hour with BFS.

Client Consistency Specification

The client consistency spec models interactions between clients and CCF, verifying guarantees like linearizability for read-write transactions and serializability for read-only transactions. TLA+ formalizes these guarantees, and trace validation ensures the implementation adheres to them. This process mirrored the distributed consensus verification effort, so details are omitted. After seeing the method applied to consensus, CCF engineers applied the method themselves with minimal involvement from formal methods specialists. The effort took about one engineer-week spread over two weeks.

Here's a bug they found: Linearizability requires transaction execution to respect real-time ordering. The OBSERVEDROINV property states that committed read-only transactions must reflect all prior committed read-write transactions. Model checking found a 12-step counterexample in four seconds: a read-only transaction was processed by an old but active leader that had not logged recent write transactions from the new leader. This scenario requires a falsely replaced leader still handling requests before retiring. They chose not to change the behavior, as serializability suffices for most applications, but the spec helps clarify guarantees for developers.

Here is an interactive exploration of this bug using the Spectacle (nee tla-web) tool. (For some reason, the trace exploration works only after you press "Reset" first.)

Results

The verification process found bugs in election quorum tallying, commit index advancement, and premature node retirement. These bugs were detected through model checking, simulation, and trace validation. Table 2 lists the most serious consensus protocol bugs found. These affected both safety and liveness and were detected at various verification stages.

One notable bug is the Commit advance on AE-NACK. The spec required a leader’s matchIndex to remain unchanged after receiving an AE-NACK, but the implementation allowed it to decrease. This came from aggressively applying an optimization from the Raft paper. A one-line code change aligned the spec with the implementation, but subsequent simulation found a 34-state counterexample violating a key correctness property. Manual translation of this counterexample into a 150 LoC functional test confirmed that the implementation could advance its commit index incorrectly. Further refinement of the spec revealed that matchIndex should never decrease except after leader elections. Adding this property allowed model checking to find a shorter counterexample. The authors report that the combination of functional testing and model checking allowed a quick and confident fix.

Many bugs were first identified during spec development and alignment with the implementation. Writing the consensus and consistency specs forced deeper thinking about protocol invariants. This process also enabled bug discoveries that would have been difficult to confirm otherwise, especially in complex reconfiguration and failure-handling scenarios.

by Murat (noreply@blogger.com)

Small Datum - Mark Callaghan

Feedback from the Postgres community about the vector index benchmarks

This is a response to some of the feedback I received from the Postgres community about my recent benchmark results for vector indexes using MariaDB and Postgres (pgvector). The work here isn't sponsored and required ~2 weeks days of server time and a few more hours of my time (yes, I contribute to the PG community).

tl;dr

index create is ~4X faster when using ~4 parallel workers. I hope that parallel DDL comes to more open source DBMS.
parallel query does not help pgvector
increasing work_mem does not help pgvector
pgvector gets less QPS than MariaDB because it uses more CPU to compute the distance metric

The feedback

The feedback I received includes

benchmarks are useless, my production workload is the only relevant workload

I disagree, but don't expect consensus. But this approach means you are unlikely to do comparative benchmarks because it costs too much to port your workload to some other DBMS just for the sake of a benchmark and only your team can run this so you will miss out on expertise from elsewhere.

this work was sponsored so I don't trust it

There isn't much I can to about that.

this is benchmarketing!

There is a marketing element to this work. Perhaps I was not the first to define benchmarketing, but by my definition a benchmark report is not benchmarketing when the results are explained. I have been transparent about how I ran the benchmark and shared some performance debugging results here. MariaDB gets more QPS than pgvector because pgvector uses more CPU to compute the distance metric.

you should try another Postgres extension for vector indexes

I hope to do that eventually. But pgvector is a great choice here because it implements HNSW and MariaDB implements modified HNSW. Regardless time is finite, the numbers of servers I have is finite and my work here (both sponsored and volunteer) competes with my need to finish my taxes, play video games, sleep, etc.

this result is bogus because I didn't like some config setting you used

Claims like this are cheap to make and expensive to debunk. Sometimes the suggested changes make a big difference but that has been rare in my experience. Most of the time the changes at best make a small difference.

this result is bogus because you used Docker

I am not a Docker fan, but I only used Docker for Qdrant. I am not a fan because I suspect it is overused and I prefer benchmark setups that don't require it. While the ann-benchmarks page states that Docker is used for all algorithms, it is not used for MariaDB or Postgres in my fork. And it is trivial, but time consuming, to update most of ann-benchmarks to not use Docker.

Claims about the config that I used include:

huge pages were disabled

Yes they were. Enabling huge pages would have helped both MariaDB and Postgres. But I prefer to not use them because the feature is painful (unless you like OOM). Posts from me on this are here and here.

track_io_timing was enabled and is expensive

There wasn't IO when queries ran as the database was cached so this is irrelevant. There was some write IO when the index was created. While I won't repeat tests with track_io_timing disabled, I am skeptical that the latency of two calls to gettimeofday() per IO are significant on modern Linux.

autovacuum_vacuum_cost_limit is too high

I set it to 4000 because I often write write-intensive benchmarks on servers with large IOPs capacities. This is the first time anyone suggested they were too high and experts have reviewed my config files. I wish that Postgres vacuum didn't require so much tuning -- and all of that tuning means that someone can always claim your tuning is wrong. Regardless, the benchmark workload is load, create index, query and I only time the create index and query steps. There is little impact from vacuum on this benchmark. I have also used hundreds of server hours to search for good Postgres configs.

parallel operations were disabled

Parallel query isn't needed for the workloads I normally run but parallel index create is nice to have. I disable parallel index create for Postgres because my primary focus is efficiency -- how much CPU and IO is consumed per operation. But I haven't been clear on that in my blog posts. Regardless, below I show there is a big impact from parallel index create and no impact from parallel query.

work_mem is too low

I have been using the default which is fine for the workloads I normally run (sysbench, insert benchmark). Below I show there is no impact from increasing it to 8M.

Benchmark

This post has much more detail about my approach in general. I ran the benchmark for 1 session. I use ann-benchmarks via my fork of a fork of a fork at this commit. I used the dbpedia-openai dataset with 1M rows. It uses angular (cosine) for the distance metric. The ann-benchmarks config files is here for Postgres and in this case I only have results for pgvector with halfvec (float16).

I used a large server (Hetzner ax162-s) with 48 cores, 128G of RAM, Ubuntu 22.04 and HW RAID 10 using 2 NVMe devices. I tested three configurations for Postgres and all of the settings are here:

def

def stands for default and is the config I used in all of my previous blog posts. Thus, it is the config for which I received feedback.

wm8

wm8 stands for work_mem increased to 8MB. The default (used by def) is 4MB.

pq4

pq4 stands for Parallel Query with ~4 workers. Here I changed a few settings from def to support that.

Output from the benchmark is here.

The command lines to run the benchmark using my helper scripts are:

bash rall.batch.sh v1 dbpedia-openai-1000k-angular c32r128

Results: QPS vs recall

These charts show the best QPS for a given recall. The graphs appears to be the same but the differences are harder to see as recall approaches 1.0 so the next section has a table with numbers.

But from these graphs, QPS doesn't improve with the wm8 or pq4 configs.

The chart for the def config which is what I used in previous blog posts.

The chart for the wm8 config with work_mem=8M

The chart for the pq4 config that uses ~4 parallel workers

Results: best QPS for a given recall

Many benchmark results are marketed via peak performance (max throughput or min response time) but these are usually constrained optimization problems -- determine peak performance that satisfies some SLA. And the SLA might be response time or efficiency (cost).

With ann-benchmarks the constraint is recall. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs.

Summary

pgvector does not get more QPS with parallel query
pgvector does not get more QPS with a larger value for work_mem

Legend:

recall, QPS - best QPS at that recall
isecs - time to create the index in seconds
m= - value for M when creating the index
ef_cons= - value for ef_construction when creating the index
ef_search= - value for ef_search when running queries

Best QPS with recall >= 1.000

no algorithms achived this for the (M, ef_construction) settings I used

Best QPS with recall >= 0.99

recall QPS isecs config

0.990 317.9 7302 PGVector_halfvec(m=48, ef_cons=256, ef_search=40)

0.990 320.4 7285 PGVector_halfvec(m=48, ef_cons=256, ef_search=40)

0.990 316.9 1565 PGVector_halfvec(m=48, ef_cons=256, ef_search=40)

Best QPS with recall >= 0.98

0.983 412.0 4120 PGVector_halfvec(m=32, ef_cons=192, ef_search=40)

0.984 415.6 4168 PGVector_halfvec(m=32, ef_cons=192, ef_search=40)

0.984 411.4 903 PGVector_halfvec(m=32, ef_cons=192, ef_search=40)

Best QPS with recall >= 0.97

0.978 487.3 5070 PGVector_halfvec(m=32, ef_cons=256, ef_search=30)

0.970 508.3 2495 PGVector_halfvec(m=32, ef_cons=96, ef_search=30)

0.970 508.4 2495 PGVector_halfvec(m=32, ef_cons=96, ef_search=30)

Best QPS with recall >= 0.96

0.961 621.1 4120 PGVector_halfvec(m=32, ef_cons=192, ef_search=20)

0.962 632.3 4168 PGVector_halfvec(m=32, ef_cons=192, ef_search=20)

0.962 622.0 903 PGVector_halfvec(m=32, ef_cons=192, ef_search=20)

Best QPS with recall >= 0.95

0.951 768.7 2436 PGVector_halfvec(m=16, ef_cons=192, ef_search=30)

0.952 770.2 2442 PGVector_halfvec(m=16, ef_cons=192, ef_search=30)

0.953 753.0 547 PGVector_halfvec(m=16, ef_cons=192, ef_search=30)

Results: create index

The database configs for Postgres are shared above and parallel index create is disabled by default because my focus has not been on DDL performance. Regardless, it works great for Postgres with pgvector. The summary is:

index create is ~4X faster when using ~4 parallel workers
index sizes are similar with and without parallel create index

Sizes: table is ~8G and index is ~4G

Legend

M - value for M when creating the index
cons - value for ef_construction when creating the index
secs - time in seconds to create the index
size(MB) - index size in MB

def wm8 pq4

M cons secs secs secs

8 32 412 405 108

16 32 649 654 155

8 64 624 627 154

16 64 1029 1029 237

32 64 1901 1895 412

8 96 834 835 194

16 96 1387 1393 312

32 96 2497 2495 541

48 96 3731 3726 798

8 192 1409 1410 316

16 192 2436 2442 547

32 192 4120 4168 903

48 192 6117 6119 1309

64 192 7838 7815 1662

8 256 1767 1752 400

16 256 3146 3148 690

32 256 5070 5083 1102

48 256 7302 7285 1565

64 256 9959 9946 2117

by Mark Callaghan (noreply@blogger.com)

Tinybird Engineering Blog

You built a Datadog alternative. Now scale it.

I bootstrapped a logs explorer, but I knew it wouldn't scale. Here's how I optimized the data pipeline for billions of logs and thousands of users.

Percona Database Performance Blog

Using VS Code and Docker to Debug MySQL Crashes

Typically, we receive customer tickets regarding crashes or bugs, where we request a core dump to analyze and identify the root cause or understand the unexpected behavior. To read the core dumps, we also request the linked libraries used by the server’s MySQL. However, there’s a more efficient way to achieve our goal: by using […]

by Roberto De Bem

Tinybird Engineering Blog

Build a Datadog alternative in 5 minutes

I built an open source template that you can use to have a free, simple Datadog alternative in about 5 minutes. Here was my process for building it.