You may have read about Hybrid search for Azure HorizonDB. It is presented as combining BM25 full‑text and vector similarity in a single query. But how are they actually combined? The execution plan answers that.
In this post, I use a small synthetic product catalog to ensure the entire demo is reproducible. The text is sufficiently realistic for BM25 queries, and the embeddings are deterministic synthetic vectors, allowing you to run the full script without needing an embedding model. If you have azure_openai.create_embeddings() configured, you can substitute the synthetic embedding function with actual embeddings.
I will test two separate queries, followed by three methods of combining BM25 with vector retrieval:
cascade (BM25 → vector)
cascade (vector → BM25)
hybrid (parallel + fusion)
These are not interchangeable; they represent distinct trade-offs between recall and performance.
Setup
This example uses:
pg_textsearch for BM25 full-text search
pgvector for the vector type and distance operator
pg_diskann for the vector index, when available
The extensions must be listed in azure.extensions to enable CREATE EXTENSION.
In addition, pg_textsearch must be loaded on startup:
I should use azure_openai.create_embeddings(), as I mentioned in the previous blog post. However, for this demo, I opted to create a fake embedding function that maps text to a simple vector(16) based on product keywords. This isn't an embedding model, but it ensures that execution plans are reproducible without relying on a model. It also simplifies the concept of embeddings with a basic, small-dimension LIKE-style semantic vector. The purpose is to demonstrate a search query using both a real text search and a vector-based semantic search.
Using a straightforward CASE that searches for specific words, I create the semantic vector for a text. This method employs vector similarity solely for demonstration, without utilizing a model. You can envision AI models doing the same but with thousands of dimensions driven by large language models (LLMs) rather than keywords.
Table with text and embeddings
I created a product catalog table with structured fields, text, and a single vector column:
I loaded a few hundred thousand rows. The data is synthetic but intentionally patterned: products have categories, materials, styles, and terms that are useful for both BM25 and vector search:
insertintoproducts(product_id,category,brand,price,title,description,embedding)withgeneratedas(selectgasproduct_id,(array['chair','table','sofa','lamp','desk','shelf'])[1+(g%6)]ascategory,(array['Contoso','Fabrikam','Northwind','AdventureWorks','Wingtip','Tailspin'])[1+(g%6)]asbrand,(array['mid-century modern','industrial','scandinavian','classic','minimalist','outdoor'])[1+(g%6)]asstyle,(array['walnut wood','black metal','oak wood','leather','fabric','brushed steel'])[1+((g/7)%6)]asmaterial,(25+(g%500))::numeric(10,2)aspricefromgenerate_series(1,200000)asg)selectproduct_id,category,brand,price,initcap(style||' '||material||' '||category)astitle,concat(style,' ',category,' by ',brand,' with ',material,'. Designed for ',casewhencategoryin('chair','sofa')then'living room seating'whencategoryin('table','desk')then'home office and dining'whencategory='lamp'then'warm interior lighting'else'storage and display'end,'. Product code ',product_id,'.')asdescription,demo_embedding(concat_ws(' ',style,material,category,brand))asembeddingfromgenerated;
I tested some searches on search_text and embedding and started indexing those columns.
Create the indexes (BM25 and DiskANN)
First, I created the BM25 full-text index. HorizonDB’s BM25 full-text search brings BM25 ranking into PostgreSQL without a separate Elasticsearch/OpenSearch Search service. It uses the open-source extension pg_textsearch:
I started with a keyword search, using to_bm25query() to define the BM25 query. Ranking uses BM25 and is performed with the <@> operator. Top-k queries use this operator in ORDER BY ... LIMIT.
BM25 is implemented as an index-backed operator that must be bound to a specific index. This is why prepared statements require explicitly naming the index:
postgres=>preparequery1(text,int)asselectp.product_id,p.title,p.category,p.brand,p.pricefromproductsporderbyp.search_text<@>to_bm25query($1,'products_bm25_idx')limit$2;postgres=>explain(analyze,buffers,verbose,costsoff)executequery1('mid century modern wooden chair',10);QUERYPLAN---------------------------------------------------------------------------------------------------------------------------------------------Limit(actualtime=0.268..0.442rows=10loops=1)Output:product_id,title,category,brand,price,((search_text<@>'products_bm25_idx:mid century modern wooden chair'::bm25query))Buffers:sharedhit=582->IndexScanusingproducts_bm25_idxonhybrid_demo.productsp(actualtime=0.267..0.440rows=10loops=1)Output:product_id,title,category,brand,price,(search_text<@>'products_bm25_idx:mid century modern wooden chair'::bm25query)OrderBy:(p.search_text<@>'products_bm25_idx:mid century modern wooden chair'::bm25query)Buffers:sharedhit=582QueryIdentifier:-4837396746836655100Planning:Buffers:sharedhit=1PlanningTime:0.116msExecutionTime:0.460ms(12rows)
The Index Scan returns the Top-10 result ('rows=10') directly in ranking order (Order By).
Lexical retrieval is good for exact words, rare terms, product codes, and anything where the user expects the same token to appear in the document.
Query 2: ANN only
I further explored the semantic aspect by performing similarity search using the cosine distance operator (<=>) for vectors:
Upstream MySQL published an out-of-schedule release this week with two high-severity CVE fixes. We’ve pulled those fixes into our next builds and are skipping the two versions we had already queued: Percona Server for MySQL 8.4.9 and 9.7.0. These fixes arrived through Oracle’s new monthly Critical Security Patch Updates (CSPUs), which Oracle announced begin May … Continued
The term “HTAP” is the holy grail of database systems. It describes what every data engineer would
love: Being able to do all your data Processing, no matter if it’s complex Analytics or
fast-paced Transactional operations in a single Hybrid system.
Many system have tried to enable HTAP but have found that building a truly hybrid system is an
impossible challenge. The keynote talk at Databricks Data + AI Summit 2026 highlighted that it is
impossible to get a query on the analytical warehouse to execute in less than 1 second. As a new
solution that solves this, the Databricks cofounder Reynold Xin announced “Databricks LTAP” and
directly cited our paper on “Morsel-Driven Parallelism” as one of the “latest and coolest academic
papers”:
With their LTAP offering, Databricks is actually able to offer sub-second transactional performance
while maintaining its well-known analytical performance. That’s genuinely impressive! But if you
look under the hood (e.g. by watching Databricks engineers talk about the technology behind LTAP),
you will see that LTAP still uses the same classical separation between operational and analytical
data. By their own description, LTAP does not run on one engine. It keeps a transactional engine and
an analytical engine and unifies them at the storage layer. You can think of this being a really
good, really fast zero-ETL system.
Zero-ETL is not enough, though. As long as you run two engines, you have two sources of truth, and
there is a moment where data crosses from one to the other. Zero-ETL makes that gap small, but even
if the name suggests otherwise, it cannot make it zero. The data still has to move from the system
that wrote it to the system that reads it. When you really want HTAP, you really care about this gap
being actually zero. Take fraud detection: a warehouse can flag suspicious activity only after the
fact, but you want to catch it before the money moves. For zero read lag and a single source of
truth, you have to re-think the entire system around one engine that runs both workloads natively.
It’s almost impossible to change an existing system designed for either transactions or analytics
into a true hybrid system. Database researchers have known this for several decades already. We
weren’t the first to attempt it, and we didn’t coin the term HTAP. Systems like SAP
HANA and HyPer
went after it before us but required your data to fit completely in main memory. Sadly, this didn’t
work out. What made us reconsider that HTAP was back on
the table was that fast SSDs became widely available. So ten years ago, we started
Umbra, a research project with one goal: building a truly HTAP system.
CedarDB is built on that foundation.
Since then, new developments in the database space focused only on analytics, leading to great
analytical systems such as Databricks Lakehouse and Snowflake, and ClickHouse. It turns out the
existing transactional systems, even regular PostgreSQL, scaled to even the most demanding AI
workloads. What’s hard is making sure transactions
and analytics don’t slow each other down when running at the same time.
To make this work well, you need to unify both the execution engine and the storage format without
introducing new bottlenecks. For that, we built a hybrid column-row
format as our data layer. It can support fast writes on hot
data, automatically transforming between hot write-optimized and cold compressed data as needed,
fully transparently as a single copy.
Not only that, we also built the foundations for fast analytical processing on modern hardware. You
can find an overview of key techniques on our technology
page, including morsel-driven parallelism, data-centric code
generation, a cost-based optimizer with full subquery decorrelation, and a buffer manager designed
to fully utilize fast SSDs.
Databricks LTAP is coming soon, CedarDB is in production today! Good to see the industry catching up
to the problem. Come see what the answer looks like when it’s already running.
The azure_ai extension on HorizonDB adds generative AI functions to PostgreSQL, allowing users to utilize Azure AI's generation, ranking, and embedding models. Here's a four-step example that demonstrates how to define default models, set up endpoints, register them, and use the in SQL queries.
1. Allow and install the extension
The azure_ai extension must be set in azure.extensions from the parameter group:
An AI Model Management feature is coming to HorizonDB, currently in private preview, which is basically a zero-setup mode for azure_ai, but for the moment, I will do it manually.
PostgreSQL can now invoke the Azure OpenAI deployment.
With this solution, the database itself doesn’t host the model. It only contains the information needed to call it, such as the endpoint, deployment, and key, but the calls to Azure OpenAI are transparent to the users.
4. Use the model from SQL queries
Now, azure_ai.generate() can use the registered model for generative AI:
postgres=>SELECTazure_ai.generate('Who is Slonik and how does he look like? context:'||version(),'default-chat');generate-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------SlonikisamascotforPostgreSQL,oftendepictedasafriendly,cartoonishelephant.Hetypicallyhasablue-graybody,largefloppyears,andacheerfulexpression.SlonikembodiesthePostgreSQLcommunity's spirit and is often used in promotional materials and events related to PostgreSQL.
(1 row)
postgres=>
The model name default-chat is the default for this function. I can omit it.
Instead of employing the model for text generation, I can utilize it to validate my text with azure_ai.is_true().
Let's verify the correct names for our favorite database:
postgres=>selectazure_ai.is_true(format('%s is the right name for Slonik''s database',unnest)),string_agg(unnest,',')fromunnest(ARRAY['PostgreSQL','Postgres','PG','pgsql','postgresql','POSTGRES','pgdb','postgres-db','SQL','psql','postmaster','postgré','postgrès','posgress','posgresql','postgrasql','postgray','postgrest','postgrezql','postgrex','postgresesql','postgresequel','Post-Ingres',' Post-Gres-Q-L','Postgres95','pg-sql','pgserver','pg-database','HorizonDB','slonik-db','elephant-db','the elephant'])groupby1orderby1desc;agg---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------t|PostgreSQL,Postgres,postgresql,POSTGRES,theelephantf|PG,pgsql,pgdb,postgres-db,SQL,psql,postmaster,postgré,postgrès,posgress,posgresql,postgrasql,postgray,postgrest,postgrezql,postgrex,postgresesql,postgresequel,Post-Ingres,Post-Gres-Q-L,Postgres95,pg-sql,pgserver,pg-database,HorizonDB,slonik-db,elephant-db(2rows)
Ok, according to gpt-4o-mini, "The Elephant" is a valid name for PostgreSQL.
The chat model can also be used to extract structured information from unstructured text. For example, from the version() banner I'm interested in the PostgreSQL compatibility version, and the name of the managed service:
postgres=>SELECTversion();version-----------------------------------------------------------------------------------------------------------------------PostgreSQL17.9(AzureHorizonDB(70f3b593ec7)(release))onx86_64-pc-linux-gnu,compiledbygcc(GCC)13.2.0,64-bit(1row)postgres=>SELECTazure_ai.extract(version(),-- text from which to extractARRAY['PostgreSQL compatibility','Cloud service name']);extract-------------------------------------------------------------------------------{"Cloud service name":"Azure HorizonDB","PostgreSQL compatibility":"17.9"}(1row)
In order to generate embeddings, I need to deploy a dedicated model:
With the same method I used for chat models, I got the endpoint, name, and key to access this embedding model:
Note that the azure_ai extension deploys functions in different namespaces. High‑level semantic operations, independent of the model vendor, such as generate(), is_true(), extract(), rank() are in the azure_ai schema. In contrast, create_embeddings, which returns vectors and is tied to the OpenAI API, is in the azure_openai schema.
I have also enabled pg_vector to use the vector data type and operators for generated embeddings. Here is an example using a prompt to find PostgreSQL settings related to shared buffer cache memory:
postgres=>CREATEEXTENSIONvector;CREATEEXTENSIONpostgres=>WITHsettingsAS(SELECTname,short_desc,azure_openai.create_embeddings('default-embedding',row_to_json(pg_settings)::text)::vectorASembeddingFROMpg_settings)SELECTname,short_descFROMsettingsORDERBYembedding<=>azure_openai.create_embeddings('default-embedding','Buffer cache shared memory allocated by PostgreSQL')::vectorLIMIT5;name|short_desc----------------------------+----------------------------------------------------------------------------------------effective_cache_size|Setstheplanner's assumption about the total size of the data caches.
shared_memory_type | Selects the shared memory implementation used for the main shared memory region.
shared_buffers | Sets the number of shared memory buffers used by the server.
shared_memory_size | Shows the size of the server'smainsharedmemoryarea(roundeduptothenearestMB).dynamic_shared_memory_type|Selectsthedynamicsharedmemoryimplementationused.(5rows)
I haven't set an index or stored embeddings here. This example simply demonstrates how it works in a stateless demo. The similarity search uses the cosine distance operator <=> to compare data from pg_settings with my prompt, returning the Top-5 matches. In a real application, you would generate the embeddings within an AI pipeline, then store and index them using pg_vector or DiskANN, rather than calling azure_openai.create_embeddings() on each query, except for the prompt.
AzureAI also provides functions for re-ranking the results of a similarity search. Ideally, a dedicated ranking model should be used. Here, I didn't deploy one and used the chat model for demonstration.:
postgres=>WITHsettingsAS(SELECTname,short_desc,row_to_json(pg_settings)::textASdoc,azure_openai.create_embeddings('default-embedding',row_to_json(pg_settings)::text)::vectorASembeddingFROMpg_settings),candidatesas(SELECTname,short_desc,docFROMsettingsORDERBYembedding<=>azure_openai.create_embeddings('default-embedding','Buffer cache shared memory allocated by PostgreSQL')::vectorLIMIT20),rerankedAS(SELECT*FROMazure_ai.rank('Setting the buffer cache shared memory allocated by PostgreSQL',ARRAY(SELECTdocFROMcandidates),-- text to rankARRAY(SELECTnameFROMcandidates),-- id of the item'default-chat'))SELECTr.*,s.short_descFROMsettingssJOINrerankedrONr.id=s.nameORDERBYr.rankLIMIT5;id|rank|score|short_desc----------------------------+------+-------+----------------------------------------------------------------------------------------shared_buffers|1|0.9|Setsthenumberof ... (truncated)
In a previous post, I explored how MongoDB chooses between nested loop, indexed loop, and hash join strategies for $lookup. Here, I examine what occurs when $lookup runs on DocumentDB for PostgreSQL—an open-source extension implementing the MongoDB API on PostgreSQL.
The document model minimizes the need for joins by embedding related data directly within documents. However, when a join is necessary — such as for reference data that updates independently, many-to-many relationships, or dimensional lookups — the flexibility of embedding can complicate join optimization.
The goal isn't just to identify "which database is faster"—it's to understand why their behaviors differ, the trade-offs involved, and the options when join performance matters.
Relational databases tend to perform more joins because normalized schemas require them, but they also optimize joins more effectively thanks to scalar, well-typed columns. In contrast, document databases perform fewer joins thanks to embedding, but when they do, flexible field semantics—such as arrays—restrict the available join algorithms.
I've run all tests in Docker containers with default settings on the same machine. The timings are indicative, not benchmarks — they illustrate the relative cost of different approaches, not absolute performance under production conditions (caching, concurrency, hardware, and tuning would all change the numbers).
An example: fact and reference table
In a document database, you'd typically embed related data to avoid joins. But some data doesn't embed well:
Exchange rates change continuously. If you embed rate_to_usd inside each portfolio document, you'd need to update millions of documents every time a rate moves.
Portfolios reference a currency, and you want to compute USD valuations by looking up the current rate at query time.
This is a classic case where a $lookup join makes sense: a large fact collection (portfolios) joined to a small, frequently-updated reference collection (fxRates). The document model can't avoid this join without accepting stale embedded rates.
Schema and data generation
I created two collections:
portfolios: 5 million documents with a currency field (5 distinct values)
fxRates: 5 documents mapping each currency to its USD exchange rate
I used mongosh to create and load the collection with the following commands:
MongoDB's $lookup combined with $unwind behaves like a LEFT OUTER JOIN followed by filtering out non-matching rows.
Why flexible documents make joins hard
In a relational database, portfolios.currency is a VARCHAR column. The optimizer knows it's a single scalar value per row. It can extract it, hash it, sort it, or probe an index with it — all with well-defined operators.
In a document database, currency might be:
A string: "USD"
An array: ["USD", "EUR"]
Missing entirely
A nested document
MongoDB's $lookup compatibility requires the following behavior:
If localField is an array ["USD", "EUR"], it matches any foreign document where foreignField equals "USD" OR "EUR" (or contains either, if it's also an array).
It's effectively an "any element matches any element" semantic.
This means that the join condition is not always a simple equality a = b, but may involve “any element matches” semantics evaluated at runtime. Instead, the matching logic must evaluate each document's field at runtime, determine whether it's a scalar or an array, and match accordingly.
The safest general approach is a lateral join — executing the inner query for each outer document and passing the current document's field value into the matching function. This is what both MongoDB and DocumentDB for PostgreSQL do.
What happens under the hood (DocumentDB for PostgreSQL)
I use the DocumentDB API in a SQL query rather than the MongoDB-compatible endpoint to view the PostgreSQL execution plan.
Since I joined a large collection with a small one and require all documents from both, I would anticipate a hash join. Instead, it uses a nested loop join:
Nested Loop (actual time=579..64792 rows=5000000 loops=1)
-> Seq Scan on documents_11 collection (rows=5000000 loops=1)
-> Seq Scan on documents_10 collection_0_1 (rows=1 loops=5000000)
Filter: bson_dollar_lookup_join_filter(...)
Rows Removed by Filter: 4
Execution Time: 87750 ms
The fxRates table (5 rows, fitting in a single 8kB block) is scanned 5 million times. PostgreSQL's cost-based optimizer knows the table is tiny and fits in cache, so a sequential scan is the right choice over an index scan — but the scan is still executed 5 million times because of the LATERAL pattern. The filter function bson_dollar_lookup_join_filter is evaluated 25 million times. This function handles array semantics — it extracts the field from the outer document, determines whether it's scalar or an array, and checks for matches in the inner document.
Because the inner side is marked as LATERAL, it depends on the current outer row. This prevents PostgreSQL from evaluating both sides independently, which is required for hash or merge joins. As a result, only a nested loop strategy is possible.
In MongoDB, the equivalent behavior is the IndexedLoopJoin strategy: for each outer document, probe the index on the foreign field. The algorithm and per-document cost are the same.
A note on MongoDB's Hash Join
MongoDB 8.0 can use hash join for $lookup when allowDiskUse: true, no compatible index on the foreign field, the foreign collection is small, and the SBE engine is active. Under these conditions, MongoDB builds an in-memory hash table from the foreign collection, correctly handling array semantics by storing per-element entries.
In tests with 5M portfolios and 5 fxRates, MongoDB's native HashJoin finished in ~14 seconds — the fastest of my tests. Without tweaks, it took 170 seconds — the worst.
To achieve 14 seconds, I dropped the index on the foreign field, enabled allowDiskUse, and set internalQueryFrameworkControl to trySbeEngine. The default trySbeRestricted mode doesn't push the $lookup and $unwind to SBE, since the optimization depends on feature flags that aren't enabled in this mode. With trySbeEngine, SBE handles the pipeline, using HashJoin:
// Setup for hash joindb.adminCommand({setParameter:1,internalQueryFrameworkControl:"trySbeEngine"});db.fxRates.dropIndex("currency_1");// The query (same as all other tests)db.portfolios.aggregate([{$lookup:{from:"fxRates",localField:"currency",foreignField:"currency",as:"fx"}},{$unwind:"$fx"},{$project:{portfolioId:1,valuation:1,currency:1,rate_to_usd:"$fx.rate_to_usd",valuation_usd:{$multiply:["$valuation","$fx.rate_to_usd"]}}}],{allowDiskUse:true}).explain("executionStats");// Restoredb.fxRates.createIndex({currency:1},{unique:true});db.adminCommand({setParameter:1,internalQueryFrameworkControl:"trySbeRestricted"});
DocumentDB for PostgreSQL doesn't currently implement this optimization — it relies on PostgreSQL's native join strategies, which don't understand BSON array semantics. Under normal conditions, both MongoDB and DocumentDB use a Nested Loop join.
Attempting alternatives via the MongoDB API
Using _id as Join Key (~71s)
The documentDB extension has a special case when foreignField is _id — it uses direct object_id equality:
// Reshape fxRates to use currency as _iddb.fxRates.drop();currencies.forEach(cur=>{db.fxRates.insertOne({_id:cur,rate_to_usd:Math.random()*(1.5-0.5)+0.5,last_updated:newDate()});});db.portfolios.aggregate([{$lookup:{from:"fxRates",localField:"currency",foreignField:"_id",as:"fx"}},{$unwind:"$fx"},{$project:{portfolioId:1,valuation:1,currency:1,rate_to_usd:"$fx.rate_to_usd",valuation_usd:{$multiply:["$valuation","$fx.rate_to_usd"]}}}])
It uses an index scan with the join condition applied as an Index Cond, which is more efficient than a sequential scan with a Filter. It's slightly faster, taking 71 seconds instead of 88 seconds, yet it remains a nested loop with 5 million iterations:
Nested Loop (actual time=17..48170 rows=5000000 loops=1)
-> Seq Scan on documents_11 collection (rows=5000000 loops=1)
-> Index Scan using _id_ on documents_12 (rows=1 loops=5000000)
Index Cond: (object_id = ANY (bson_dollar_lookup_extract_filter_array(...)))
Execution Time: 70578 ms
This is the same as MongoDB's IndexedLoopJoin — the _id field is guaranteed to be scalar, so the extension can use a direct equality lookup on the primary key. However, it doesn't change the join strategy.
Uncorrelated $lookup + $filter (~68s)
A minor enhancement involves reading all fxRates at once, using an empty pipeline and no join condition, attaching the data as an array, and then filtering locally:
The execution plan shows a Nested Loop with a single loop:
Nested Loop (actual time=17..20177 rows=5000000 loops=1)
-> Aggregate (rows=1 loops=1) -- reads fxRates ONCE
-> Seq Scan on documents_11 (rows=5000000 loops=1)
Execution Time: 67905 ms (of which ~48s is $addFields+$project)
The join itself is fast — fxRates are aggregated once into a single array. But the per-document $filter + $arrayElemAt evaluates BSON expressions 5 million times. We traded "nested loop probe" for "per-row array scan in BSON space".
This is conceptually similar to the "nested loop with materialization" approach from the previous MongoDB article — reading the lookup collection once, but matching per-document in the projection.
Pipeline-Based $lookup — No Help
Using $lookup with pipeline and let doesn't enhance performance:
The extension still creates a LATERAL join (all code paths set rightTree->lateral = true), and it introduces additional overhead due to variable resolution.
With the MongoDB-compatible API, no solution significantly improves the efficiency of the join. But on DocumentDB, the power of SQL opens new possibilities.
The PostgreSQL escape hatch: SQL with Hash Join
Since DocumentDB stores data in standard PostgreSQL tables, we can query the same collections with SQL—within the same transaction and with full ACID guarantees. The trade-off is that we lose flexible-document join semantics and assume scalar join keys.
Prerequisite: enabling Hash Join for the BSON type (a hack)
The bson type has a hash operator class (bson_hash_ops) used for GROUP BY and DISTINCT. But the =operator doesn't declare hash join support — it's missingHASHES and MERGES properties. This is likely intentional, since bson = bson comparison on full documents has different semantics than field-level equality. But for my investigation (comparing extracted scalar field values), it would work:
-- Requires superuser — this is a hack, not a supported configuration-- If DocumentDB enables this in the future, it will be part of the extensionALTEROPERATORdocumentdb_core.=(documentdb_core.bson,documentdb_core.bson)SET(COMMUTATOR=OPERATOR(documentdb_core.=),HASHES,MERGES);
Without this, PostgreSQL cannot execute hash join for bson = bson conditions, even in custom SQL. However, note that the SQL hash join method, enabled by this hack, does not replicate MongoDB's "any element matches" behavior when joined fields include arrays.
The Query: CTE with Hash Join (~39s)
To utilize a SQL join, I first query the two collections within two common table expressions in the WITH clause, then join them in the main query:
PostgreSQL creates a small 5-row hash table (9 kB) and probes it once per portfolio. It makes a single pass over both collections. Most of the remaining time is spent calling bson_expression_get 5 million times to retrieve the join key, along with bson_dollar_merge_documents_at_path and bson_dollar_project to generate the final output.
In the end, this query is only about twice as fast. It requires a complex workaround, breaks document semantics, and still spends most of its time evaluating BSON expressions.
Results Summary
I started Docker containers as:
# Start MongoDB (minimal setup for ACID transactions)
docker exec-it$(
docker run -d-p 27017:27017 mongo --replSet rs0
) mongosh --eval"rs.initiate()"# Start DocumentDB (PostgreSQL and MongoDB-compatible endpoints)
docker run -d-p 9712:9712 -p 10260:10260 \
ghcr.io/documentdb/documentdb/documentdb-local:latest \--username"franck"--password"franck"--start-pg
Below is a summary of my experiments, run in Docker containers with default configurations, involving 5 million portfolios, 5 fxRates, and a unique index on fxRates.currency:
Approach
MongoDB
DocumentDB
Strategy
$lookup localField/foreignField
~170s
~88s
Nested Loop (lateral index/filter)
$lookup with foreignField: "_id"
~155s
~71s
Nested Loop (index probe)
Uncorrelated $lookup + $filter
~22s
~68s
Materialize once + per-doc filter
SQL CTE + Hash Join (operator tweak)
—
~39s
Hash Join (forced)
HashJoin (SBE, internal tweak)
~14s
—
Hash Join (forced)
MongoDB's native HashJoin via the Slot-Based Execution engine is fastest, handling hash table build/probe natively with per-element array support and avoiding BSON field extraction overhead, but will not be used without configuration tweaks. The DocumentDB SQL escape hatch uses PostgreSQL's optimizer for the same join strategy but incurs overhead with bson_expression_get on each row.
The other solutions are compatible with standard configurations and use appropriate data models and query code. Remember that the time here reflects reading five million documents, and the difference may be insignificant on small datasets.
The Trade-off: Flexibility vs. Optimization
These experiments show the trade-off clearly. Relational systems rely on joins due to normalization, but they can optimize them effectively thanks to typed scalar columns. Document databases avoid many joins, but when joins are needed, flexible semantics—like arrays—limit the available algorithms.
DocumentDB for PostgreSQL sits in the middle. It relies on PostgreSQL storage and execution while preserving MongoDB semantics. As a result, $lookup uses only a subset of SQL's join capabilities to preserve this flexibility. The SQL workaround shows that performance improves when you enforce scalar semantics, but this runs counter to the expectations of a document model, where any field in one document can be an array in another.
So the real question is not which system is faster, but which trade-off you choose: flexibility with embedded arrays or optimization for scalar values.
This was tested on MongoDB 8.0 and DocumentDB 0.112 on PostgreSQL 17.10, and both can improve in the future. Optimization is possible when the field is a known scalar. But if you have a fixed schema, do you still want a document database or switch to SQL? PostgreSQL can also gain optimizations that benefit DocumentDB queries. For example, the lateral join could be memoized in a future version.
If you're thinking about using DocumentDB for PostgreSQL — whether you're migrating from MongoDB or starting fresh — don't stop at the first slow query. Look into the causes, since the trade-off between speed and flexibility can differ. Check execution plans, and file an issue or start a discussion. More feedback from real workloads helps the contributors improve the extension. That's a major advantage of open source.
Running pgvector on Amazon Aurora PostgreSQL gives you a production-grade vector store on a database you already know, backed by the operational tooling, high availability, and scaling behaviour of Amazon Aurora. Production traffic does introduce a predictable set of operational considerations: query latency as the corpus grows, recall on filtered vector searches, memory headroom during index builds, and connection behaviour under load. This post is scoped to the database operations that keep the RAG retrieval layer healthy. In this post, we cover the operational practices that keep a pgvector workload healthy once you depend on it: choosing the right index and distance function, scaling with quantization and partitioning, managing Hierarchical Navigable Small World (HNSW) churn, sizing for memory-resident operation, and the observability signals that catch problems early.
When migrating from Oracle Database to PostgreSQL, the Oracle Foreign Data Wrapper enables PostgreSQL to connect to Oracle and import data as if it were a local table. This extension is not available on all managed PostgreSQL services. For Azure Database for PostgreSQL, setup is straightforward since it runs community PostgreSQL. The only requirement is that extensions must first be enabled via the azure-specific parameter azure.extensions:
Oracle_FDW extension in PostgreSQL
I can create the extension as the administrator and utilize its features:
It's important to understand how PostgreSQL will connect to the Oracle Database: oracle_fdw is installed with the Oracle Instant Client libraries.
An Oracle Database in a lab
I exposed an Oracle database to the internet for lab purposes, so I did not need to configure the network. I set up using the Docker image.
# Start Oracle and expose port 1521 to all interfacessudo docker run -d--name ora -p 1521:1521 -eORACLE_PASSWORD=franck gvenzl/oracle-free:slim
# Open port 1521 in the firewall (you may have to add a rule in the inbound security rules)sudo firewall-cmd --zone=public --permanent--add-port=1521/tcp
sudo firewall-cmd --reload# Starting Oracle Database takes time, wait for it to be upecho"Waiting for Oracle";until docker logs ora |
grep"Completed: Pluggable database FREEPDB1 opened read write"do echo-n".";sleep 1 ;done# Load the SCOTT/TIGER schema example
docker exec-it ora bash -c'
TWO_TASK=//localhost:1521/FREEPDB1 sqlplus system/franck @?/rdbms/admin/utlsampl.sql
'# Allow SCOTT to EXPLAIN PLAN
docker exec-it ora bash -c'
echo "GRANT SELECT_CATALOG_ROLE TO SCOTT;" |
TWO_TASK=//localhost:1521/FREEPDB1 sqlplus system/franck
'# Show the connection string from internetecho"You can connect as: SCOTT/TIGER@//$(curl -s ifconfig.me):1521/FREEPDB1"
Foreign Data Wrapper server and user
Now, back to Azure Database for PostgreSQL, I can specify the Oracle Database server via the connection string, as well as the user and password I use for connection:
When using the VERBOSE option, EXPLAIN executes an EXPLAIN PLAN on the remote server. (I granted SELECT_CATALOG_ROLE to prevent the no SELECT privilege on V$SQL error.)
An EXPLAIN ANALYZE displays the timing of remote calls, which here is approximately 12 milliseconds:
This 12ms latency is expected because my PostgreSQL database is located in Amsterdam (Azure West Europe), while the Oracle Database is in Paris (OCI eu-paris-1).
Network troubleshooting
I can verify the location using the TCP/IP information in the Oracle Database listener.
The IP address embedded in PROGRAM reflects the PostgreSQL client process name, not Oracle network metadata (I used psql in Zurich). The IP in HOST indicates the PostgreSQL client connecting to the Oracle listener, while the first IP from ipinfo.io shows the location of the Oracle Database host.
A latency of 12ms is expected between these two cloud regions:
If I need to troubleshoot, I can run tcpdump -i any port 1521 -nn on the Oracle Database host:
sudo tcpdump -i any port 1521 -nn-tttt-vvv
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vvfor full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
2026-06-24 22:01:44.583852 IP 51.124.207.22.1024 > 10.0.0.59.1521: Flags [S], seq 987172167, win 65535, options [mss 1440,sackOK,TS val 2030652947 ecr 0,nop,wscale 9], length 0
2026-06-24 22:01:44.583952 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [S], seq 987172167, win 65535, options [mss 1440,sackOK,TS val 2030652947 ecr 0,nop,wscale 9], length 0
2026-06-24 22:01:44.583960 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [S], seq 987172167, win 65535, options [mss 1440,sackOK,TS val 2030652947 ecr 0,nop,wscale 9], length 0
2026-06-24 22:01:44.583988 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [S.], seq 944233141, ack 987172168, win 65160, options [mss 1460,sackOK,TS val 2168564814 ecr 2030652947,nop,wscale 7], length 0
2026-06-24 22:01:44.583996 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [S.], seq 944233141, ack 987172168, win 65160, options [mss 1460,sackOK,TS val 2168564814 ecr 2030652947,nop,wscale 7], length 0
2026-06-24 22:01:44.584048 IP 10.0.0.59.1521 > 51.124.207.22.1024: Flags [S.], seq 944233141, ack 987172168, win 65160, options [mss 1460,sackOK,TS val 2168564814 ecr 2030652947,nop,wscale 7], length 0
2026-06-24 22:01:44.602184 IP 51.124.207.22.1024 > 10.0.0.59.1521: Flags [.], ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 0
2026-06-24 22:01:44.602200 IP 51.124.207.22.1024 > 10.0.0.59.1521: Flags [P.], seq 1:75, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 74
2026-06-24 22:01:44.602201 IP 51.124.207.22.1024 > 10.0.0.59.1521: Flags [P.], seq 75:336, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 261
2026-06-24 22:01:44.602241 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [.], ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 0
2026-06-24 22:01:44.602249 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [.], ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 0
2026-06-24 22:01:44.602252 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [P.], seq 1:75, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 74
2026-06-24 22:01:44.602253 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [P.], seq 1:75, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 74
2026-06-24 22:01:44.602254 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [P.], seq 75:336, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 261
2026-06-24 22:01:44.602255 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [P.], seq 75:336, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 261
2026-06-24 22:01:44.602295 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [.], ack 75, win 509, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.602303 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [.], ack 75, win 509, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.602312 IP 10.0.0.59.1521 > 51.124.207.22.1024: Flags [.], ack 75, win 509, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.602435 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [.], ack 336, win 507, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.602446 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [.], ack 336, win 507, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.602462 IP 10.0.0.59.1521 > 51.124.207.22.1024: Flags [.], ack 336, win 507, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.619442 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [P.], seq 1:62, ack 336, win 507, options [nop,nop,TS val 2168564850 ecr 2030652960], length 61
2026-06-24 22:01:44.619470 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [P.], seq 1:62, ack 336, win 507, options [nop,nop,TS val 2168564850 ecr 2030652960], length 61
2026-06-24 22:01:44.619493 IP 10.0.0.59.1521 > 51.124.207.22.1024: Flags [P.], seq 1:62, ack 336, win 507, options [nop,nop,TS val 2168564850 ecr 2030652960], length 61
2026-06-24 22:01:44.630863 IP 51.124.207.22.1024 > 10.0.0.59.1521: Flags [.], ack 62, win 128, options [nop,nop,TS val 2030652994 ecr 2168564850], length 0
Duplicate packets originating from 172.17.0.2 are from Docker networking and are captured on both the host and container bridges. This marks the start of a TCP connection from Azure Database for PostgreSQL to Oracle on port 1521, beginning with the TCP handshake. PostgreSQL then sends two TNS packets to Oracle, which acknowledges both. Oracle responds with a 61-byte payload. TCP flags include [S] (connection setup), [S.] (SYN+ACK), [.] (ACK), and [P.] (PSH + ACK). The absence of the U flag indicates Oracle Net isn't using TCP urgent data, resolving previous OOB issues with older clients.
Production connection string
Running the Oracle Database in a lab helps with troubleshooting, but the database you want to migrate is not publicly accessible, and the connection string may be a bit more complex. As the Oracle client is on a managed service, you cannot add files such as tnsnames.ora or an mTLS wallet. I tested with an Oracle Autonomous database.
You need to enable One-Way TLS (encryption without a wallet) by setting Mutual TLS (mTLS) authentication to not required, and add the IP address of the PostgreSQL database to the allow list:
I got this address from the listener log of my previous test.
Then I got the TLS connection string for the Oracle Autonomous database and used it to define the foreign data wrapper server:
After specifying the connection string and credentials, you can read or import tables from Oracle Autonomous into Azure Database for PostgreSQL.
Conclusion
Oracle FDW allows querying an Oracle database from PostgreSQL with minimal setup. On Azure Database for PostgreSQL, the only extra step is to enable the extension. Otherwise, it functions as standard PostgreSQL. This feature is useful for migrations to import or compare data between the two databases. Queries are straightforward, and the planner decides what runs remotely. Performance depends on pushdown to Oracle and network latency. Filters and joins may run remotely or locally, and if pushdown isn’t possible, large result sets may be transferred and joined locally. All this is visible from the execution plan.
Connecting to Oracle can sometimes be challenging. Make sure to run oracle_diag() and verify that both oracle_fdw and the Oracle client are current and compatible with your Oracle database, that the port is opened on the network between the two databases, and that you can connect with a connection string without additional files.
In this post, you learn how to build a Spring Boot REST API that integrates with Aurora DSQL. You’ll configure the Aurora DSQL JDBC Connector for IAM authentication, implement optimistic concurrency control, and run the application across two regional nodes to observe active-active behavior.
Looking at my peak male physique, and my Keanu Reeves baby face, you would never suspect it, but I recently turned 50. As is the tradition, I thought about writing a post titled "50 Lessons at 50". Unfortunately, I don't have that kind of wisdom. The thing is, I still feel like I'm 18, same age as my son. Turns out this is the secret old guys have been hiding from us all along. You get older on the outside, but inside you still see yourself as the same young lad.
Still, fifty years should count for something. So what did I actually learn? How am I different mentally than my 18-year-old self. Here is my attempt to tally it up.
1. Caution is warranted
I finally understand my parents. As you age, you accumulate battle scars, and the scars turn into habits. Anything that can go wrong will go wrong. You forget the cooktop on once, and suddenly you check it three times before leaving the house. True story. You stop diving head-first into a pile of leaves, because there might be a rake underneath. Another true story.
But the trap is overcorrecting. Common sense is good, but too much common sense is dangerous. It makes you overly cautious, predictable, and dull, and it talks you out of the leaf pile even when there is no rake. So, it is important to fight that entropy.
2. Realism helps you understand how the world runs
When I was younger I thought everyone was good, smart, rational, and reasonable. I was idealistic, and I was projecting my own motivations onto everyone else. As a professor, I rarely questioned the administrators. Surely everyone was trying to improve education for the students. Not really. In fact, many of the students were not particularly interested in improving their own education either.
This was a disappointing realization, but then it was also liberating. The world gets much easier to understand once you take off the rose-colored glasses and start watching the incentives instead of the stated intentions. You stop being surprised, and you stop being hurt. You get a model of the world that actually predicts the world. Recognizing where the incentives point doesn't obligate you to like them. You can decide to play the game, or refuse it, or even try to change it.
3. Competence is bliss
I alluded to the Murphy's Law earlier. In 1949, an engineer named Edward Murphy got frustrated after a technician wired the sensors backwards during an Air Force rocket sled experiment. This led to the aphorism "if there is a way to do it wrong, someone will find it". People remember this as a line about bad luck, but it's actually about how easy it is to do things wrong, and how rare it is to do them right.
It took me a couple of decades to fully appreciate that competence is far scarcer than you'd expect. This is both bad news and good news. The bad news is obvious, and all of us are going to experience it many times, at the hands of incompetent parties. The good news is that competence stands out. When you meet genuinely competent people you notice immediately, and you appreciate them. They are reliable. They sweat the details and understand the tradeoffs. You start seeking them out, and when you find them, friendships form fast, because you both know how unusual the thing actually is.
4. Do what you like
I've always wanted an excuse to use the Bell Curve meme, so here it goes.
Most people dramatically overestimate their ability to predict what will matter in ten years or even five years. The elaborate plans are mostly a waste of time. The best strategy is work on what you like, and what you find interesting and energizing, because curiosity, enthusiasm, and the craft compounds. The safest long-term strategy is often to become exceptionally good at something you genuinely enjoy.
"Find out who you are and do it on purpose." --Dolly Parton
5. Your attitude determines your success
I am not likely to turn into the old guy yelling at the cloud any time soon, as I am aware of the danger of learned pessimism and helplessness. So I care about this lesson more than the others. Your attitude determines your success, and the stories you tell yourself matter. Since our emotions drive our persistence and our willingness to keep going, it pays to keep a constructive narrative about setbacks and challenges.
I aim for cautious optimism. I am realistic about the risks, but still optimistic about the possibilities. I am jaded enough to see how the world works, but still idealistic enough to want to improve it. Years ago I wrote a post called Fool Yourself, about deliberately choosing the narrative that gives you energy and momentum. I still believe that. At 50 I'm a little wiser, certainly more cautious and aware of the challenges ahead, but I am still fooling myself into chasing things I find meaningful.
In a previous post, I explained that using filtering with pgvector can decrease recall in approximate nearest neighbor (ANN) searches.
This post repeats the same experiment and dataset but compares two methods:
pgvector with HNSW, a popular PostgreSQL extension
HorizonDB with DiskANN, Microsoft’s vector index
The goal is to understand what happens when similarity search is combined with filtering.
Both setups were tested on a HorizonDB instance with 2 vCores and 16 GiB RAM in Azure public preview, where I activated the extensions by adding them to azure.extensions.
In real applications, SQL statements without a WHERE clause are rare, and typical queries combine a filter alongside a similarity search:
The execution plan indicates that the color = 'red' filter was applied to the scan result, and the distance was computed from the filtered data to sort and return the top-15.
This approach works when the filter is sufficiently selective to enable distance calculations and sorting on a small set of rows. However, with larger or expanding data sets, a quick neighbor search requires an index that can efficiently find neighbors.
With HNSW index - Approximate Nearest Neighbor Index Scan
Once the index is created, the query runs faster, but the result, compared to the previous execution, is incomplete:
The HNSW index retrieves a fixed number of candidates
PostgreSQL applies the WHERE filter afterward
Only matching rows are returned, which can result in fewer rows than requested
For example, if the index examines about 40 candidates (ef_search = 40, the default) and the filter matches one third of the rows, we would expect about 13 matching rows on average. This is approximate because the index examines a local region of the vector space, where the metadata distribution may differ from the global distribution.
In this execution plan, 40 candidates were visited (12 returned, 28 filtered out), but only 12 matched the predicate, falling short of the query's LIMIT 15. These candidates are not necessarily the closest neighbors that satisfy the filter. The true nearest neighbors with color = 'red' might lie outside the explored region and were therefore not considered, because the region included other colors.
As a result, fewer than LIMIT rows may be returned, and even those may not be the actual nearest neighbors within the filtered subset. This explains the drop in recall, which results from post-filtering.
In the execution plan, this is evident as the condition on color appears as a Filter applied after the index scan:
The execution plan details show what happened: 12 returned + 28 filtered = 40 candidates, which is ef_search.
HNSW applies filtering after traversal, meaning the graph is searched globally before rows are discarded. This breaks the assumption that the closest neighbors are among the candidates returned by the index, especially when using selective filters, as they often are not.
The mitigations include:
Increasing ef_search. For instance, based on our data distribution, setting hnsw.ef_search = 200 could provide enough candidates so that at least 15 remain after filtering.
Using partial indexes when filtering on a small set of discrete, known values, so each is indexing a specific subset.
Enabling iterative scans to expand the search scope when filtering removes too many candidates. This improves recall but requires multiple scans.
These measures improve recall, though they come with higher costs and require manual calibration because filtering is handled separately from the ANN search.
DiskANN on HorizonDB
Now I create a DiskANN index, designed for filtering during the search:
In HorizonDB documentation, it is explained that DiskANN enables advanced filtering for combined vector and metadata queries. Depending on selectivity and planner estimates, it can apply filters during traversal or fall back to post-filtering
With DiskANN index - Approximate Nearest Neighbor Index Scan
The note is interesting: pre-filtering was not used. Still, the result is complete because DiskANN dynamically adapts the number of candidates explored.
This means the filter was not applied during traversal, so the query falls back to post-filtering, as with pgvector. However, unlike HNSW, the number of candidates is not fixed: the engine explores more vectors until enough rows are returned.
In the execution plan, we see that 42 rows were filtered out and 15 were returned, meaning about 57 candidates were explored:
By varying the LIMIT, we can see that even without pre-filtering, the results are complete and remain fast for common top-n queries:
LIMIT
Rows Removed by Filter
rows=
Buffers
Execution Time
1
1
1
3779
4.215 ms
2
1
2
3790
6.431 ms
5
22
5
4613
8.153 ms
10
29
10
5310
9.841 ms
15
42
15
5722
10.893 ms
100
222
100
15364
39.989 ms
1000
1879
1000
91585
826.303 ms
10000
20202
10000
602407
27827.608 ms
This distinction is crucial: pgvector uses a fixed search budget (ef_search), whereas DiskANN adjusts the number of explored candidates. Even with post-filtering, it can provide enough rows without manual tuning and without incremental scans.
Highly selective query - Exact Nearest Neighbor
If the filter is highly selective, there is no need to use an approximate search:
The execution plan shows that it didn't use the vector index but instead used the B-tree index on id with an additional Sort for exact nearest-neighbor ordering:
This is where the execution plan displays a Custom Scan (DiskANNFilteredScan), enabled by PostgreSQL's extensibility. It performs pre-filtering (Filter(IndexScan)) and Approximate Nearest Neighbor search (Vector) to yield the precise number of candidates in the filtered set (Rows Retrieved: 15 count):
This plan is an ANN with pre-filtering, effective when both the column filter and the embedding filter are selective and need to be combined for efficient execution. The optimizer makes this decision dynamically.
TIDs Collected indicates the number of rows matching the indexable portion of the filter prior to the vector search. According to the cardinalities, the id condition was used for prefiltering:
This represents a selectivity of about 15% for the pre-filtered set. The remaining condition (color = 'red') is applied afterward, reducing the result further to about 5% of the table.
Why this matters
These experiments reveal a different execution model, not just performance differences. With pgvector and HNSW, search is drive... (truncated)
In 1957, Parkinson postulated his "Law of Triviality" using a fictitious committee reviewing plans for a nuclear power plant. The reactor design gets 10 minutes because nobody understands it, so nobody argues. The bike shed gets 45 minutes because everyone has opinions about the paint color.
I feel like we are living this committee meeting at scale every day.
LLMs are already better engineers than most of us. They are better at formal methods, and better at reasoning under pressure than most people. They run at incredible speed and don't get tired. They improve continuously. But, some people keep moving the goalposts on LLMs. First they said LLMs couldn't code. Then they said they hallucinated too much. All of these barriers fell, but some people are still scoffing at these systems. What chutzpah!
If aliens landed in Central Park tomorrow, I don't think the reaction would be that different. With AI, an alien form of intelligence has already arrived in our laps, and we gave it a collective shrug and kept scrolling.
Psychologists call this the normalcy bias. We tend to assume things will continue roughly as they have, even when confronted with something that requires revising our picture of the future.
Douglas Adams introduces an adjacent phenomena, called the Somebody Else's Problem (SEP) field. The SEP field does not try to make something invisible directly, rather it makes your brain classify it as somebody else's problem, so it gets actively skipped. An alien spacecraft hovers over a crowded park unnoticed, because everyone has silently agreed it's someone else's problem.
The individual adaptation strategy makes sense of course. Learn the tools, and stay ahead of the curve to save your own ship. Why not? Stay afloat the next couple years. But what about the coming decades?
I am not saying everyone should engage with everything. I know about the circles of control, influence, and concern. But, unfortunately, even the people who should be debating this (technologists, researchers, economists, policymakers, ethicists, military planners, educators) are not engaging with this problem at the depth it deserves.
We are having the bike shed debate about the reactor, and unfortunately the reactor design doesn't get enough attention.
This has results from sysbench on a small server with Postgres versions 12 through 19 beta1. Sysbench is run with high concurrency (40 connections) and a cached database. The purpose is to search for changes in performance.
Postgres remains boring, it is hard to find performance regressions.
tl;dr for Postgres 17 to 19
there are no regressions
throughput on the read-only-count test improves by ~3X in 19 beta1 thanks to a better query plan
tl;dr for Postgres 12 to 19
there are few regressions, throughput might have dropped by up to 5% on a few range query tests
there are a few large improvements for read-only tests
there are many large improvements for write-heavy tests
Builds, configuration and hardware
I compiled Postgres from source for versions 12.22, 13.23, 14.23, 15.18, 16.14, 17.10, 18.4 and 19 beta1.
I used a 48-core server from Hetzner
an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
128G RAM
Ubuntu 24.04
Configuration files for Postgres:
the config file is named conf.diff.cx10a_c32r128 (x10a_c32r128) and is here for versions 12, 13, 14, 15, 16 and 17.
for Postgres 18 and 19 I used conf.diff.cx10b_c32r128(x10b_c32r128) which is as close as possible to the Postgres 17 config and uses io_method=sync
Benchmark
I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by Postgres.
The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 1200 seconds. The benchmark is run with 40 clients and 8 tables with 10M rows per table. The database is cached.
The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.
The tests can be called microbenchmarks. They are very synthetic. But microbenchmarks also make it easy to understand which types of SQL statements have great or lousy performance. Performance testing benefits from a variety of workloads -- both more and less synthetic.
Results
The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.
I provide charts below with relative QPS (rQPS). The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version. When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.
Here, base version is either Postgres 12.23 or 17.10 and some version is a more recent version. I use 12.23 as the base version to identify regressions over a long period of time. And then I use 17.10 as the base version to confirm there aren't recent, large regressions.
I describe performance changes (changes to relative QPS) in terms of basis points. Performance changes by one basis point when the difference in rQPS is 0.01. When rQPS decreases from 0.95 to 0.85 then it changed by 10 basis points.
Results: point queries, version 17 to 19
Summary:
there are no regressins
Relative to: PG 17.10
col-1 : PG 18.4
col-2 : PG 19 beta1
col-1 col-2
1.00 0.99 hot-points
1.01 1.01 point-query
1.00 1.00 points-covered-pk
0.98 0.99 points-covered-si
1.01 1.00 points-notcovered-pk
1.00 1.00 points-notcovered-si
1.01 1.01 random-points_range=10
1.02 1.00 random-points_range=100
1.00 1.00 random-points_range=1000
Results: point queries, version 12 to 19
Summary
there are no regressions
throughput for the hot-points test improves by ~2X in versions 17.10, 18.4 and 19beta