June 26, 2026
$lookup join strategies: understanding the trade-offs with flexible documents
In a previous post, I explored how MongoDB chooses between nested loop, indexed loop, and hash join strategies for $lookup. Here, I examine what occurs when $lookup runs on DocumentDB for PostgreSQL—an open-source extension implementing the MongoDB API on PostgreSQL.
The document model minimizes the need for joins by embedding related data directly within documents. However, when a join is necessary — such as for reference data that updates independently, many-to-many relationships, or dimensional lookups — the flexibility of embedding can complicate join optimization.
The goal isn't just to identify "which database is faster"—it's to understand why their behaviors differ, the trade-offs involved, and the options when join performance matters.
Relational databases tend to perform more joins because normalized schemas require them, but they also optimize joins more effectively thanks to scalar, well-typed columns. In contrast, document databases perform fewer joins thanks to embedding, but when they do, flexible field semantics—such as arrays—restrict the available join algorithms.
I've run all tests in Docker containers with default settings on the same machine. The timings are indicative, not benchmarks — they illustrate the relative cost of different approaches, not absolute performance under production conditions (caching, concurrency, hardware, and tuning would all change the numbers).
An example: fact and reference table
In a document database, you'd typically embed related data to avoid joins. But some data doesn't embed well:
-
Exchange rates change continuously. If you embed
rate_to_usdinside each portfolio document, you'd need to update millions of documents every time a rate moves. - Portfolios reference a currency, and you want to compute USD valuations by looking up the current rate at query time.
This is a classic case where a $lookup join makes sense: a large fact collection (portfolios) joined to a small, frequently-updated reference collection (fxRates). The document model can't avoid this join without accepting stale embedded rates.
Schema and data generation
I created two collections:
-
portfolios: 5 million documents with acurrencyfield (5 distinct values) -
fxRates: 5 documents mapping each currency to its USD exchange rate
I used mongosh to create and load the collection with the following commands:
db.portfolios.drop();
db.fxRates.drop();
const currencies = ["USD", "EUR", "CHF", "GBP", "JPY"];
currencies.forEach(cur => {
db.fxRates.insertOne({
currency: cur,
rate_to_usd: Math.random() * (1.5 - 0.5) + 0.5,
last_updated: new Date()
});
});
const totalPortfolios = 5e6;
let bulk = [];
for (let i = 1; i <= totalPortfolios; i++) {
const currency = currencies[Math.floor(Math.random() * currencies.length)];
bulk.push({
portfolioId: i,
clientId: Math.floor(Math.random() * 10000),
valuation: Math.round(Math.random() * 1_000_000),
currency: currency,
asOfDate: new Date()
});
if (bulk.length === 10000) {
db.portfolios.insertMany(bulk);
bulk = [];
}
}
if (bulk.length > 0) db.portfolios.insertMany(bulk);
db.fxRates.createIndex({ currency: 1 }, { unique: true });
The index on a five-document collection is not strictly necessary, but it's good practice and protects my lookup table from duplicates.
The Query with $lookup
This query fetches all portfolios, retrieves the foreign exchange rate for each currency, and converts the valuation to USD.
db.portfolios.aggregate([
{$lookup: {
from: "fxRates",
localField: "currency",
foreignField: "currency",
as: "fx"
}},
{$unwind: "$fx"},
{$project: {
portfolioId: 1, valuation: 1, currency: 1,
rate_to_usd: "$fx.rate_to_usd",
valuation_usd: {$multiply: ["$valuation", "$fx.rate_to_usd"]}
}}
])
MongoDB's $lookup combined with $unwind behaves like a LEFT OUTER JOIN followed by filtering out non-matching rows.
Why flexible documents make joins hard
In a relational database, portfolios.currency is a VARCHAR column. The optimizer knows it's a single scalar value per row. It can extract it, hash it, sort it, or probe an index with it — all with well-defined operators.
In a document database, currency might be:
- A string:
"USD" - An array:
["USD", "EUR"] - Missing entirely
- A nested document
MongoDB's $lookup compatibility requires the following behavior:
- If
localFieldis an array["USD", "EUR"], it matches any foreign document whereforeignFieldequals"USD"OR"EUR"(or contains either, if it's also an array). - It's effectively an "any element matches any element" semantic.
This means that the join condition is not always a simple equality a = b, but may involve “any element matches” semantics evaluated at runtime. Instead, the matching logic must evaluate each document's field at runtime, determine whether it's a scalar or an array, and match accordingly.
The safest general approach is a lateral join — executing the inner query for each outer document and passing the current document's field value into the matching function. This is what both MongoDB and DocumentDB for PostgreSQL do.
What happens under the hood (DocumentDB for PostgreSQL)
I use the DocumentDB API in a SQL query rather than the MongoDB-compatible endpoint to view the PostgreSQL execution plan.
EXPLAIN (ANALYZE ON, BUFFERS ON, COSTS ON, VERBOSE ON)
SELECT document
FROM bson_aggregation_pipeline('test',
'{
"aggregate": "portfolios",
"pipeline": [
{
"$lookup": {
"from": "fxRates",
"localField": "currency",
"foreignField": "currency",
"as": "fx"
}
},
{
"$unwind": "$fx"
},
{
"$project": {
"portfolioId": 1,
"valuation": 1,
"currency": 1,
"rate_to_usd": "$fx.rate_to_usd",
"valuation_usd": {
"$multiply": [
"$valuation",
"$fx.rate_to_usd"
]
}
}
}
],
"cursor": {}
}');
Since I joined a large collection with a small one and require all documents from both, I would anticipate a hash join. Instead, it uses a nested loop join:
Nested Loop (actual time=579..64792 rows=5000000 loops=1)
-> Seq Scan on documents_11 collection (rows=5000000 loops=1)
-> Seq Scan on documents_10 collection_0_1 (rows=1 loops=5000000)
Filter: bson_dollar_lookup_join_filter(...)
Rows Removed by Filter: 4
Execution Time: 87750 ms
The fxRates table (5 rows, fitting in a single 8kB block) is scanned 5 million times. PostgreSQL's cost-based optimizer knows the table is tiny and fits in cache, so a sequential scan is the right choice over an index scan — but the scan is still executed 5 million times because of the LATERAL pattern. The filter function bson_dollar_lookup_join_filter is evaluated 25 million times. This function handles array semantics — it extracts the field from the outer document, determines whether it's scalar or an array, and checks for matches in the inner document.
Because the inner side is marked as LATERAL, it depends on the current outer row. This prevents PostgreSQL from evaluating both sides independently, which is required for hash or merge joins. As a result, only a nested loop strategy is possible.
In MongoDB, the equivalent behavior is the IndexedLoopJoin strategy: for each outer document, probe the index on the foreign field. The algorithm and per-document cost are the same.
A note on MongoDB's Hash Join
MongoDB 8.0 can use hash join for $lookup when allowDiskUse: true, no compatible index on the foreign field, the foreign collection is small, and the SBE engine is active. Under these conditions, MongoDB builds an in-memory hash table from the foreign collection, correctly handling array semantics by storing per-element entries.
In tests with 5M portfolios and 5 fxRates, MongoDB's native HashJoin finished in ~14 seconds — the fastest of my tests. Without tweaks, it took 170 seconds — the worst.
To achieve 14 seconds, I dropped the index on the foreign field, enabled allowDiskUse, and set internalQueryFrameworkControl to trySbeEngine. The default trySbeRestricted mode doesn't push the $lookup and $unwind to SBE, since the optimization depends on feature flags that aren't enabled in this mode. With trySbeEngine, SBE handles the pipeline, using HashJoin:
// Setup for hash join
db.adminCommand({setParameter: 1, internalQueryFrameworkControl: "trySbeEngine"});
db.fxRates.dropIndex("currency_1");
// The query (same as all other tests)
db.portfolios.aggregate([
{$lookup: {from: "fxRates", localField: "currency", foreignField: "currency", as: "fx"}},
{$unwind: "$fx"},
{$project: {portfolioId: 1, valuation: 1, currency: 1, rate_to_usd: "$fx.rate_to_usd", valuation_usd: {$multiply: ["$valuation", "$fx.rate_to_usd"]}}}
], {allowDiskUse: true}).explain("executionStats");
// Restore
db.fxRates.createIndex({currency: 1}, {unique: true});
db.adminCommand({setParameter: 1, internalQueryFrameworkControl: "trySbeRestricted"});
DocumentDB for PostgreSQL doesn't currently implement this optimization — it relies on PostgreSQL's native join strategies, which don't understand BSON array semantics. Under normal conditions, both MongoDB and DocumentDB use a Nested Loop join.
Attempting alternatives via the MongoDB API
Using _id as Join Key (~71s)
The documentDB extension has a special case when foreignField is _id — it uses direct object_id equality:
// Reshape fxRates to use currency as _id
db.fxRates.drop();
currencies.forEach(cur => {
db.fxRates.insertOne({
_id: cur,
rate_to_usd: Math.random() * (1.5 - 0.5) + 0.5,
last_updated: new Date()
});
});
db.portfolios.aggregate([
{$lookup: {from: "fxRates", localField: "currency", foreignField: "_id", as: "fx"}},
{$unwind: "$fx"},
{$project: {portfolioId:1, valuation:1, currency:1,
rate_to_usd:"$fx.rate_to_usd",
valuation_usd:{$multiply:["$valuation","$fx.rate_to_usd"]}}}
])
It uses an index scan with the join condition applied as an Index Cond, which is more efficient than a sequential scan with a Filter. It's slightly faster, taking 71 seconds instead of 88 seconds, yet it remains a nested loop with 5 million iterations:
Nested Loop (actual time=17..48170 rows=5000000 loops=1)
-> Seq Scan on documents_11 collection (rows=5000000 loops=1)
-> Index Scan using _id_ on documents_12 (rows=1 loops=5000000)
Index Cond: (object_id = ANY (bson_dollar_lookup_extract_filter_array(...)))
Execution Time: 70578 ms
This is the same as MongoDB's IndexedLoopJoin — the _id field is guaranteed to be scalar, so the extension can use a direct equality lookup on the primary key. However, it doesn't change the join strategy.
Uncorrelated $lookup + $filter (~68s)
A minor enhancement involves reading all fxRates at once, using an empty pipeline and no join condition, attaching the data as an array, and then filtering locally:
db.portfolios.aggregate([
{$lookup: {from: "fxRates", pipeline: [], as: "allFx"}},
{$addFields: {
fx: {$arrayElemAt: [{$filter: {
input: "$allFx", as: "r",
cond: {$eq: ["$$r.currency", "$currency"]}
}}, 0]}
}},
{$project: {portfolioId:1, valuation:1, currency:1,
rate_to_usd:"$fx.rate_to_usd",
valuation_usd:{$multiply:["$valuation","$fx.rate_to_usd"]}}}
])
The execution plan shows a Nested Loop with a single loop:
Nested Loop (actual time=17..20177 rows=5000000 loops=1)
-> Aggregate (rows=1 loops=1) -- reads fxRates ONCE
-> Seq Scan on documents_11 (rows=5000000 loops=1)
Execution Time: 67905 ms (of which ~48s is $addFields+$project)
The join itself is fast — fxRates are aggregated once into a single array. But the per-document $filter + $arrayElemAt evaluates BSON expressions 5 million times. We traded "nested loop probe" for "per-row array scan in BSON space".
This is conceptually similar to the "nested loop with materialization" approach from the previous MongoDB article — reading the lookup collection once, but matching per-document in the projection.
Pipeline-Based $lookup — No Help
Using $lookup with pipeline and let doesn't enhance performance:
{$lookup: {
from: "fxRates",
let: { cur: "$currency" },
pipeline: [
{$match: {$expr: {$eq: ["$currency", "$$cur"]}}}
],
as: "fx"
}},
{$unwind: "$fx"},
The extension still creates a LATERAL join (all code paths set rightTree->lateral = true), and it introduces additional overhead due to variable resolution.
With the MongoDB-compatible API, no solution significantly improves the efficiency of the join. But on DocumentDB, the power of SQL opens new possibilities.
The PostgreSQL escape hatch: SQL with Hash Join
Since DocumentDB stores data in standard PostgreSQL tables, we can query the same collections with SQL—within the same transaction and with full ACID guarantees. The trade-off is that we lose flexible-document join semantics and assume scalar join keys.
Prerequisite: enabling Hash Join for the BSON type (a hack)
The bson type has a hash operator class (bson_hash_ops) used for GROUP BY and DISTINCT. But the = operator doesn't declare hash join support — it's missing HASHES and MERGES properties. This is likely intentional, since bson = bson comparison on full documents has different semantics than field-level equality. But for my investigation (comparing extracted scalar field values), it would work:
-- Requires superuser — this is a hack, not a supported configuration
-- If DocumentDB enables this in the future, it will be part of the extension
ALTER OPERATOR documentdb_core.= (documentdb_core.bson, documentdb_core.bson)
SET (COMMUTATOR = OPERATOR(documentdb_core.=), HASHES, MERGES);
Without this, PostgreSQL cannot execute hash join for bson = bson conditions, even in custom SQL. However, note that the SQL hash join method, enabled by this hack, does not replicate MongoDB's "any element matches" behavior when joined fields include arrays.
The Query: CTE with Hash Join (~39s)
To utilize a SQL join, I first query the two collections within two common table expressions in the WITH clause, then join them in the main query:
WITH portfolios AS (
SELECT document FROM documentdb_api.collection('test', 'portfolios')
),
fxRates AS (
SELECT document FROM documentdb_api.collection('test', 'fxRates')
)
SELECT documentdb_api_internal.bson_dollar_project(
documentdb_api_internal.bson_dollar_merge_documents_at_path(
p.document, f.document, 'fx'),
'{ "portfolioId" : 1, "valuation" : 1, "currency" : 1,
"rate_to_usd" : "$fx.rate_to_usd",
"valuation_usd" : { "$multiply" : ["$valuation", "$fx.rate_to_usd"] } }'::bson,
'{}'::bson
)
FROM portfolios p
JOIN fxRates f
ON documentdb_api_catalog.bson_expression_get(
p.document, '{"": "$currency"}'::bson, true)
= documentdb_api_catalog.bson_expression_get(
f.document, '{"": "$currency"}'::bson, true);
With this query and the operator tweak enabling hash join, I have the following execution plan:
Hash Join (actual time=7.4..34018 rows=5000000 loops=1)
Hash Cond: (bson_expression_get(documents_11.document, '{"":"$currency"}'...)
= bson_expression_get(documents_10.document, '{"":"$currency"}'...))
-> Seq Scan on documents_11 (rows=5000000 loops=1)
-> Hash (rows=5 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on documents_10 (rows=5 loops=1)
Execution Time: 38664 ms
PostgreSQL creates a small 5-row hash table (9 kB) and probes it once per portfolio. It makes a single pass over both collections. Most of the remaining time is spent calling bson_expression_get 5 million times to retrieve the join key, along with bson_dollar_merge_documents_at_path and bson_dollar_project to generate the final output.
In the end, this query is only about twice as fast. It requires a complex workaround, breaks document semantics, and still spends most of its time evaluating BSON expressions.
Results Summary
I started Docker containers as:
# Start MongoDB (minimal setup for ACID transactions)
docker exec -it $(
docker run -d -p 27017:27017 mongo --replSet rs0
) mongosh --eval "rs.initiate()"
# Start DocumentDB (PostgreSQL and MongoDB-compatible endpoints)
docker run -d -p 9712:9712 -p 10260:10260 \
ghcr.io/documentdb/documentdb/documentdb-local:latest \
--username "franck" --password "franck" --start-pg
Below is a summary of my experiments, run in Docker containers with default configurations, involving 5 million portfolios, 5 fxRates, and a unique index on fxRates.currency:
| Approach | MongoDB | DocumentDB | Strategy |
|---|---|---|---|
$lookup localField/foreignField |
~170s | ~88s | Nested Loop (lateral index/filter) |
$lookup with foreignField: "_id"
|
~155s | ~71s | Nested Loop (index probe) |
Uncorrelated $lookup + $filter
|
~22s | ~68s | Materialize once + per-doc filter |
| SQL CTE + Hash Join (operator tweak) | — | ~39s | Hash Join (forced) |
HashJoin (SBE, internal tweak) |
~14s | — | Hash Join (forced) |
MongoDB's native HashJoin via the Slot-Based Execution engine is fastest, handling hash table build/probe natively with per-element array support and avoiding BSON field extraction overhead, but will not be used without configuration tweaks. The DocumentDB SQL escape hatch uses PostgreSQL's optimizer for the same join strategy but incurs overhead with bson_expression_get on each row.
The other solutions are compatible with standard configurations and use appropriate data models and query code. Remember that the time here reflects reading five million documents, and the difference may be insignificant on small datasets.
The Trade-off: Flexibility vs. Optimization
These experiments show the trade-off clearly. Relational systems rely on joins due to normalization, but they can optimize them effectively thanks to typed scalar columns. Document databases avoid many joins, but when joins are needed, flexible semantics—like arrays—limit the available algorithms.
DocumentDB for PostgreSQL sits in the middle. It relies on PostgreSQL storage and execution while preserving MongoDB semantics. As a result, $lookup uses only a subset of SQL's join capabilities to preserve this flexibility. The SQL workaround shows that performance improves when you enforce scalar semantics, but this runs counter to the expectations of a document model, where any field in one document can be an array in another.
So the real question is not which system is faster, but which trade-off you choose: flexibility with embedded arrays or optimization for scalar values.
This was tested on MongoDB 8.0 and DocumentDB 0.112 on PostgreSQL 17.10, and both can improve in the future. Optimization is possible when the field is a known scalar. But if you have a fixed schema, do you still want a document database or switch to SQL? PostgreSQL can also gain optimizations that benefit DocumentDB queries. For example, the lateral join could be memoized in a future version.
If you're thinking about using DocumentDB for PostgreSQL — whether you're migrating from MongoDB or starting fresh — don't stop at the first slow query. Look into the causes, since the trade-off between speed and flexibility can differ. Check execution plans, and file an issue or start a discussion. More feedback from real workloads helps the contributors improve the extension. That's a major advantage of open source.
June 25, 2026
Running pgvector in production on Amazon Aurora PostgreSQL
Oracle FDW on Azure Database for PostgreSQL
When migrating from Oracle Database to PostgreSQL, the Oracle Foreign Data Wrapper enables PostgreSQL to connect to Oracle and import data as if it were a local table. This extension is not available on all managed PostgreSQL services. For Azure Database for PostgreSQL, setup is straightforward since it runs community PostgreSQL. The only requirement is that extensions must first be enabled via the azure-specific parameter azure.extensions:
Oracle_FDW extension in PostgreSQL
I can create the extension as the administrator and utilize its features:
postgres=> \du
List of roles
Role name | Attributes
----------------+------------------------------------------------------------
azure_pg_admin | Cannot login
azuresu | Superuser, Create role, Create DB, Replication, Bypass RLS
franck | Create role, Create DB, Bypass RLS
replication | Replication
postgres=> \connect - franck
I am now connected to the database "postgres" as user "franck".
postgres=> create extension oracle_fdw;
CREATE EXTENSION
postgres=> select oracle_diag();
oracle_diag
------------------------------------------------------------------------------------------------------------
oracle_fdw 2.8.0, PostgreSQL 18.4, Oracle client 23.26.0.0.0, ORACLE_HOME=/opt/oracle/oracle_instantclient
It's important to understand how PostgreSQL will connect to the Oracle Database: oracle_fdw is installed with the Oracle Instant Client libraries.
An Oracle Database in a lab
I exposed an Oracle database to the internet for lab purposes, so I did not need to configure the network. I set up using the Docker image.
# Start Oracle and expose port 1521 to all interfaces
sudo docker run -d --name ora -p 1521:1521 -e ORACLE_PASSWORD=franck gvenzl/oracle-free:slim
# Open port 1521 in the firewall (you may have to add a rule in the inbound security rules)
sudo firewall-cmd --zone=public --permanent --add-port=1521/tcp
sudo firewall-cmd --reload
# Starting Oracle Database takes time, wait for it to be up
echo "Waiting for Oracle" ; until docker logs ora |
grep "Completed: Pluggable database FREEPDB1 opened read write"
do echo -n "." ; sleep 1 ; done
# Load the SCOTT/TIGER schema example
docker exec -it ora bash -c '
TWO_TASK=//localhost:1521/FREEPDB1 sqlplus system/franck @?/rdbms/admin/utlsampl.sql
'
# Allow SCOTT to EXPLAIN PLAN
docker exec -it ora bash -c '
echo "GRANT SELECT_CATALOG_ROLE TO SCOTT;" |
TWO_TASK=//localhost:1521/FREEPDB1 sqlplus system/franck
'
# Show the connection string from internet
echo "You can connect as: SCOTT/TIGER@//$(curl -s ifconfig.me):1521/FREEPDB1"
Foreign Data Wrapper server and user
Now, back to Azure Database for PostgreSQL, I can specify the Oracle Database server via the connection string, as well as the user and password I use for connection:
postgres=> create server oracle_server
foreign data wrapper oracle_fdw
options (dbserver '//141.145.212.208:1521/FREEPDB1');
CREATE SERVER
postgres=> create user mapping for current_user
server oracle_server
options (user 'SCOTT', password 'TIGER');
CREATE USER MAPPING
I import the tables I need to query:
postgres=> IMPORT FOREIGN SCHEMA "SCOTT"
FROM SERVER oracle_server INTO public
;
postgres=> \d
List of relations
Schema | Name | Type | Owner
--------+----------+---------------+--------
public | bonus | foreign table | franck
public | dept | foreign table | franck
public | emp | foreign table | franck
public | salgrade | foreign table | franck
(4 rows)
postgres=> select * from dept;
deptno | dname | loc
--------+------------+----------
10 | ACCOUNTING | NEW YORK
20 | RESEARCH | DALLAS
30 | SALES | CHICAGO
40 | OPERATIONS | BOSTON
(4 rows)
I can access the tables as if they were local, my PostgreSQL session being connected to the remote Oracle Database.
Performance and execution plan
The execution plan indicates whether the join has been pushed down, a decision made by the query planner:
postgres=> explain (verbose off)
select * from dept join emp using(deptno)
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=30012.50..40190.00 rows=5000 width=206)
Hash Cond: (dept.deptno = emp.deptno)
-> Foreign Scan on dept (cost=10000.00..20000.00 rows=1000 width=92)
Oracle query: SELECT /*62b5e8b0d88d7cfb*/ r1."DEPTNO", r1."DNAME", r1."LOC" FROM "SCOTT"."DEPT" r1
-> Hash (cost=20000.00..20000.00 rows=1000 width=116)
-> Foreign Scan on emp (cost=10000.00..20000.00 rows=1000 width=116)
Oracle query: SELECT /*d6c3774a98386064*/ r2."EMPNO", r2."ENAME", r2."JOB", r2."MGR", r2."HIREDATE", r2."SAL", r2."COMM", r2."DEPTNO" FROM "SCOTT"."EMP" r2
(7 rows)
postgres=> explain (verbose off)
select * from dept join emp using(deptno)
where ename='KING'
;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Foreign Scan (cost=10000.00..10250.00 rows=25 width=206)
Oracle query: SELECT /*7a68726e19f9c7b7*/ r1."DEPTNO", r1."DNAME", r1."LOC", r2."EMPNO", r2."ENAME", r2."JOB", r2."MGR", r2."HIREDATE", r2."SAL", r2."COMM" FROM ("SCOTT"."DEPT" r1 INNER JOIN "SCOTT"."EMP" r2 ON (r1."DEPTNO" = r2."DEPTNO") AND (r2."ENAME" = 'KING'))
(2 rows)
When using the VERBOSE option, EXPLAIN executes an EXPLAIN PLAN on the remote server. (I granted SELECT_CATALOG_ROLE to prevent the no SELECT privilege on V$SQL error.)
postgres=> explain (verbose on, analyze, costs off)
select * from dept join emp using(deptno)
where ename='KING'
;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Foreign Scan (actual time=11.621..11.632 rows=1.00 loops=1)
Output: dept.deptno, dept.dname, dept.loc, emp.empno, emp.ename, emp.job, emp.mgr, emp.hiredate, emp.sal, emp.comm
Oracle query: SELECT /*7a68726e19f9c7b7*/ r1."DEPTNO", r1."DNAME", r1."LOC", r2."EMPNO", r2."ENAME", r2."JOB", r2."MGR", r2."HIREDATE", r2."SAL", r2."COMM" FROM ("SCOTT"."DEPT" r1 INNER JOIN "SCOTT"."EMP" r2 ON (r1."DEPTNO" = r2."DEPTNO") AND (r2."ENAME" = 'KING'))
Oracle plan: SELECT STATEMENT
Oracle plan: HASH JOIN (condition "R1"."DEPTNO"="R2"."DEPTNO")
Oracle plan: NESTED LOOPS
Oracle plan: NESTED LOOPS
Oracle plan: STATISTICS COLLECTOR
Oracle plan: TABLE ACCESS FULL EMP (filter "R2"."ENAME"='KING')
Oracle plan: INDEX UNIQUE SCAN PK_DEPT (condition "R1"."DEPTNO"="R2"."DEPTNO")
Oracle plan: TABLE ACCESS BY INDEX ROWID DEPT
Oracle plan: TABLE ACCESS FULL DEPT
Query Identifier: 3032555283609452836
Planning Time: 35.288 ms
Execution Time: 11.704 ms
(15 rows)
An EXPLAIN ANALYZE displays the timing of remote calls, which here is approximately 12 milliseconds:
This 12ms latency is expected because my PostgreSQL database is located in Amsterdam (Azure West Europe), while the Oracle Database is in Paris (OCI eu-paris-1).
Network troubleshooting
I can verify the location using the TCP/IP information in the Oracle Database listener.
docker exec -it ora bash -c 'curl -s ipinfo.io ; echo ; echo ; grep "PROGRAM=postgres" /opt/oracle/diag/tnslsnr/*/listener/trace/listener.log | tail -1'
{
"ip": "141.145.212.208",
"city": "Paris",
"region": "Île-de-France",
"country": "FR",
"loc": "48.8534,2.3488",
"org": "AS31898 Oracle Corporation",
"postal": "75000",
"timezone": "Europe/Paris",
"readme": "https://ipinfo.io/missingauth"
}
24-JUN-2026 21:28:23:647 * (CONNECT_DATA=(SERVICE_NAME=FREEPDB1)(CID=(PROGRAM=postgres:?franck?postgres?152.67.90.237?35830??SELECT)(HOST=fea1c65ae6cf)(USER=azuredb))(CONNECTION_ID=VQdZFJxbFlngYwQBIQob2g==))(TARGET_LOCAL_INSTANCE=FREE) * (ADDRESS=(PROTOCOL=tcp)(HOST=51.124.207.22)(PORT=1024)) * establish * FREEPDB1 * 0
The IP address embedded in PROGRAM reflects the PostgreSQL client process name, not Oracle network metadata (I used psql in Zurich). The IP in HOST indicates the PostgreSQL client connecting to the Oracle listener, while the first IP from ipinfo.io shows the location of the Oracle Database host.
A latency of 12ms is expected between these two cloud regions:
If I need to troubleshoot, I can run tcpdump -i any port 1521 -nn on the Oracle Database host:
sudo tcpdump -i any port 1521 -nn -tttt -vvv
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
2026-06-24 22:01:44.583852 IP 51.124.207.22.1024 > 10.0.0.59.1521: Flags [S], seq 987172167, win 65535, options [mss 1440,sackOK,TS val 2030652947 ecr 0,nop,wscale 9], length 0
2026-06-24 22:01:44.583952 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [S], seq 987172167, win 65535, options [mss 1440,sackOK,TS val 2030652947 ecr 0,nop,wscale 9], length 0
2026-06-24 22:01:44.583960 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [S], seq 987172167, win 65535, options [mss 1440,sackOK,TS val 2030652947 ecr 0,nop,wscale 9], length 0
2026-06-24 22:01:44.583988 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [S.], seq 944233141, ack 987172168, win 65160, options [mss 1460,sackOK,TS val 2168564814 ecr 2030652947,nop,wscale 7], length 0
2026-06-24 22:01:44.583996 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [S.], seq 944233141, ack 987172168, win 65160, options [mss 1460,sackOK,TS val 2168564814 ecr 2030652947,nop,wscale 7], length 0
2026-06-24 22:01:44.584048 IP 10.0.0.59.1521 > 51.124.207.22.1024: Flags [S.], seq 944233141, ack 987172168, win 65160, options [mss 1460,sackOK,TS val 2168564814 ecr 2030652947,nop,wscale 7], length 0
2026-06-24 22:01:44.602184 IP 51.124.207.22.1024 > 10.0.0.59.1521: Flags [.], ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 0
2026-06-24 22:01:44.602200 IP 51.124.207.22.1024 > 10.0.0.59.1521: Flags [P.], seq 1:75, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 74
2026-06-24 22:01:44.602201 IP 51.124.207.22.1024 > 10.0.0.59.1521: Flags [P.], seq 75:336, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 261
2026-06-24 22:01:44.602241 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [.], ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 0
2026-06-24 22:01:44.602249 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [.], ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 0
2026-06-24 22:01:44.602252 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [P.], seq 1:75, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 74
2026-06-24 22:01:44.602253 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [P.], seq 1:75, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 74
2026-06-24 22:01:44.602254 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [P.], seq 75:336, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 261
2026-06-24 22:01:44.602255 IP 51.124.207.22.1024 > 172.17.0.2.1521: Flags [P.], seq 75:336, ack 1, win 128, options [nop,nop,TS val 2030652960 ecr 2168564814], length 261
2026-06-24 22:01:44.602295 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [.], ack 75, win 509, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.602303 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [.], ack 75, win 509, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.602312 IP 10.0.0.59.1521 > 51.124.207.22.1024: Flags [.], ack 75, win 509, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.602435 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [.], ack 336, win 507, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.602446 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [.], ack 336, win 507, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.602462 IP 10.0.0.59.1521 > 51.124.207.22.1024: Flags [.], ack 336, win 507, options [nop,nop,TS val 2168564833 ecr 2030652960], length 0
2026-06-24 22:01:44.619442 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [P.], seq 1:62, ack 336, win 507, options [nop,nop,TS val 2168564850 ecr 2030652960], length 61
2026-06-24 22:01:44.619470 IP 172.17.0.2.1521 > 51.124.207.22.1024: Flags [P.], seq 1:62, ack 336, win 507, options [nop,nop,TS val 2168564850 ecr 2030652960], length 61
2026-06-24 22:01:44.619493 IP 10.0.0.59.1521 > 51.124.207.22.1024: Flags [P.], seq 1:62, ack 336, win 507, options [nop,nop,TS val 2168564850 ecr 2030652960], length 61
2026-06-24 22:01:44.630863 IP 51.124.207.22.1024 > 10.0.0.59.1521: Flags [.], ack 62, win 128, options [nop,nop,TS val 2030652994 ecr 2168564850], length 0
Duplicate packets originating from 172.17.0.2 are from Docker networking and are captured on both the host and container bridges. This marks the start of a TCP connection from Azure Database for PostgreSQL to Oracle on port 1521, beginning with the TCP handshake. PostgreSQL then sends two TNS packets to Oracle, which acknowledges both. Oracle responds with a 61-byte payload. TCP flags include [S] (connection setup), [S.] (SYN+ACK), [.] (ACK), and [P.] (PSH + ACK). The absence of the U flag indicates Oracle Net isn't using TCP urgent data, resolving previous OOB issues with older clients.
Production connection string
Running the Oracle Database in a lab helps with troubleshooting, but the database you want to migrate is not publicly accessible, and the connection string may be a bit more complex. As the Oracle client is on a managed service, you cannot add files such as tnsnames.ora or an mTLS wallet. I tested with an Oracle Autonomous database.
You need to enable One-Way TLS (encryption without a wallet) by setting Mutual TLS (mTLS) authentication to not required, and add the IP address of the PostgreSQL database to the allow list:
I got this address from the listener log of my previous test.
Then I got the TLS connection string for the Oracle Autonomous database and used it to define the foreign data wrapper server:
create server oracle_autonomous
foreign data wrapper oracle_fdw
options (dbserver '(description= (retry_count=20)(retry_delay=3)(address=(protocol=tcps)(port=1521)(host=adb.eu-madrid-1.oraclecloud.com))(connect_data=(service_name=g230b6cc64a62e6_mad_tp.adb.oraclecloud.com))(security=(ssl_server_dn_match=yes)))');
create user mapping for current_user
server oracle_autonomous
options (user 'ADMIN', password '4ut0n0m0u5_Password');
After specifying the connection string and credentials, you can read or import tables from Oracle Autonomous into Azure Database for PostgreSQL.
Conclusion
Oracle FDW allows querying an Oracle database from PostgreSQL with minimal setup. On Azure Database for PostgreSQL, the only extra step is to enable the extension. Otherwise, it functions as standard PostgreSQL. This feature is useful for migrations to import or compare data between the two databases. Queries are straightforward, and the planner decides what runs remotely. Performance depends on pushdown to Oracle and network latency. Filters and joins may run remotely or locally, and if pushdown isn’t possible, large result sets may be transferred and joined locally. All this is visible from the execution plan.
Connecting to Oracle can sometimes be challenging. Make sure to run oracle_diag() and verify that both oracle_fdw and the Oracle client are current and compatible with your Oracle database, that the port is opened on the network between the two databases, and that you can connect with a connection string without additional files.
One Postgres cluster, many apps
June 24, 2026
Build a Spring Boot REST API with Amazon Aurora DSQL
5 Lessons at 50
Looking at my peak male physique, and my Keanu Reeves baby face, you would never suspect it, but I recently turned 50. As is the tradition, I thought about writing a post titled "50 Lessons at 50". Unfortunately, I don't have that kind of wisdom. The thing is, I still feel like I'm 18, same age as my son. Turns out this is the secret old guys have been hiding from us all along. You get older on the outside, but inside you still see yourself as the same young lad.
Still, fifty years should count for something. So what did I actually learn? How am I different mentally than my 18-year-old self. Here is my attempt to tally it up.
1. Caution is warranted
I finally understand my parents. As you age, you accumulate battle scars, and the scars turn into habits. Anything that can go wrong will go wrong. You forget the cooktop on once, and suddenly you check it three times before leaving the house. True story. You stop diving head-first into a pile of leaves, because there might be a rake underneath. Another true story.
But the trap is overcorrecting. Common sense is good, but too much common sense is dangerous. It makes you overly cautious, predictable, and dull, and it talks you out of the leaf pile even when there is no rake. So, it is important to fight that entropy.
2. Realism helps you understand how the world runs
When I was younger I thought everyone was good, smart, rational, and reasonable. I was idealistic, and I was projecting my own motivations onto everyone else. As a professor, I rarely questioned the administrators. Surely everyone was trying to improve education for the students. Not really. In fact, many of the students were not particularly interested in improving their own education either.
This was a disappointing realization, but then it was also liberating. The world gets much easier to understand once you take off the rose-colored glasses and start watching the incentives instead of the stated intentions. You stop being surprised, and you stop being hurt. You get a model of the world that actually predicts the world. Recognizing where the incentives point doesn't obligate you to like them. You can decide to play the game, or refuse it, or even try to change it.
3. Competence is bliss
I alluded to the Murphy's Law earlier. In 1949, an engineer named Edward Murphy got frustrated after a technician wired the sensors backwards during an Air Force rocket sled experiment. This led to the aphorism "if there is a way to do it wrong, someone will find it". People remember this as a line about bad luck, but it's actually about how easy it is to do things wrong, and how rare it is to do them right.
It took me a couple of decades to fully appreciate that competence is far scarcer than you'd expect. This is both bad news and good news. The bad news is obvious, and all of us are going to experience it many times, at the hands of incompetent parties. The good news is that competence stands out. When you meet genuinely competent people you notice immediately, and you appreciate them. They are reliable. They sweat the details and understand the tradeoffs. You start seeking them out, and when you find them, friendships form fast, because you both know how unusual the thing actually is.
4. Do what you like
I've always wanted an excuse to use the Bell Curve meme, so here it goes.
Most people dramatically overestimate their ability to predict what will matter in ten years or even five years. The elaborate plans are mostly a waste of time. The best strategy is work on what you like, and what you find interesting and energizing, because curiosity, enthusiasm, and the craft compounds. The safest long-term strategy is often to become exceptionally good at something you genuinely enjoy.
"Find out who you are and do it on purpose." --Dolly Parton
5. Your attitude determines your success
I am not likely to turn into the old guy yelling at the cloud any time soon, as I am aware of the danger of learned pessimism and helplessness. So I care about this lesson more than the others. Your attitude determines your success, and the stories you tell yourself matter. Since our emotions drive our persistence and our willingness to keep going, it pays to keep a constructive narrative about setbacks and challenges.
I aim for cautious optimism. I am realistic about the risks, but still optimistic about the possibilities. I am jaded enough to see how the world works, but still idealistic enough to want to improve it. Years ago I wrote a post called Fool Yourself, about deliberately choosing the narrative that gives you energy and momentum. I still believe that. At 50 I'm a little wiser, certainly more cautious and aware of the challenges ahead, but I am still fooling myself into chasing things I find meaningful.
June 22, 2026
Vector Search with Filters: pgvector vs DiskANN on HorizonDB
In a previous post, I explained that using filtering with pgvector can decrease recall in approximate nearest neighbor (ANN) searches.
This post repeats the same experiment and dataset but compares two methods:
- pgvector with HNSW, a popular PostgreSQL extension
- HorizonDB with DiskANN, Microsoft’s vector index
The goal is to understand what happens when similarity search is combined with filtering.
Both setups were tested on a HorizonDB instance with 2 vCores and 16 GiB RAM in Azure public preview, where I activated the extensions by adding them to azure.extensions.
In real applications, SQL statements without a WHERE clause are rare, and typical queries combine a filter alongside a similarity search:
SELECT *
FROM embeddings
WHERE tenant_id = 42
ORDER BY embedding <=> :vector_query
LIMIT 10;
You don’t search the entire dataset. Instead, you focus on a specific subset, such as one tenant, recent data, or a category.
Experiment Setup
As in the previous post, I inserted 2 million rows of 512‑dimensional vectors with a simple metadata column (color, evenly distributed):
CREATE TABLE embeddings_table (
id bigserial PRIMARY KEY,
color text,
embedding vector(512)
);
I generated random vectors to insert the embeddings, and three values for color:
create function random_embedding(dimensions int) returns vector as
$$
select
array(
select random()::real
from generate_series(1, dimensions)
)::vector
$$ language sql;
insert into embeddings_table (embedding,color)
select random_embedding(512)
,(array['red', 'green', 'blue'])[generate_series(1, 2000000)%3+1]
;
This uniform distribution is intentional: it isolates the effect of filtering.
pgvector (HNSW)
Here's the HNSW index that I created in my previous post:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE INDEX ON embeddings_table
USING hnsw (embedding vector_cosine_ops)
;
It can take some time. Here is the query I use to monitor progress:
SELECT
command,
phase,
CASE WHEN blocks_total > 0 THEN round(100.0 * blocks_done / blocks_total, 1) ELSE 0 END AS blocks_pct,
CASE WHEN tuples_total > 0 THEN round(100.0 * tuples_done / tuples_total, 1) ELSE 0 END AS tuples_pct,
relid::regclass AS table_name,
index_relid::regclass AS index_name
FROM pg_stat_progress_create_index
;
I might have raised the maintenance_work_mem as indicated by the NOTICE after index creation — it took more than one hour.
NOTICE: hnsw graph no longer fits into maintenance_work_mem after 195123 tuples
CREATE INDEX
Total execution time: 01:20:03.861
I generate another embedding for my query, which I'll store as a psql variable with \gset:
select random_embedding(512) as query
\gset
Without index - Seq Scan and Exact Nearest Neighbor Search
I checked the behavior before the index is created, without using a vector index:
--EXPLAIN (ANALYZE, BUFFERS, VERBOSE off, COSTS off)
SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red'
ORDER BY embedding <=> :'query'
LIMIT 15
;
Without a vector index, pgvector performs an exact nearest neighbors (ENN) search:
postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red'
ORDER BY embedding <=> :'query'
LIMIT 15
;
id | color | nn_cosine
---------+-------+---------------------
1370302 | red | 0.2639440793770883
1048027 | red | 0.23483715071295053
804529 | red | 0.24173628441472828
1013860 | red | 0.25945607630683887
1011571 | red | 0.23552944123421315
1514713 | red | 0.2324248794732996
1411876 | red | 0.21627272961247324
1743091 | red | 0.25835400937319386
494461 | red | 0.23541082932461943
2128600 | red | 0.2531862056756401
2118649 | red | 0.27055299026445434
2151097 | red | 0.25835806753276225
1988278 | red | 0.2341769026974998
1259950 | red | 0.21806901991480443
1553203 | red | 0.24358174125683563
(15 rows)
The execution plan indicates that the color = 'red' filter was applied to the scan result, and the distance was computed from the filtered data to sort and return the top-15.
Limit (actual time=4088.254..4164.226 rows=15 loops=1)
Buffers: shared hit=5348519
-> Gather Merge (actual time=4088.252..4164.222 rows=15 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=5348519
-> Sort (actual time=4058.895..4058.897 rows=11 loops=3)
Sort Key: ((embedding <=> '[0.51098114,0.8720485,0.8568168,...,0.110820755]'::vector))
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=5348519
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on embeddings_table (actual time=0.127..3949.246 rows=222222 loops=3)
Filter: (color = 'red'::text)
Rows Removed by Filter: 444445
Buffers: shared hit=5348445
Planning Time: 0.078 ms
Execution Time: 4164.252 ms
(18 rows)
This approach works when the filter is sufficiently selective to enable distance calculations and sorting on a small set of rows. However, with larger or expanding data sets, a quick neighbor search requires an index that can efficiently find neighbors.
With HNSW index - Approximate Nearest Neighbor Index Scan
Once the index is created, the query runs faster, but the result, compared to the previous execution, is incomplete:
postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red'
ORDER BY embedding <=> :'query'
LIMIT 15
;
id | color | nn_cosine
---------+-------+---------------------
1259950 | red | 0.21806901991480443
744973 | red | 0.21565428393281894
550210 | red | 0.2085433809181183
1860280 | red | 0.24081963623331726
1794061 | red | 0.22253304456085543
1056052 | red | 0.232418043420476
1247836 | red | 0.22905966001837208
1128376 | red | 0.24222950580938118
847906 | red | 0.24276414827559645
1782067 | red | 0.233356563767256
2023984 | red | 0.24263831933162294
1438981 | red | 0.21360709574533177
(12 rows)
This is what happened:
- The HNSW index retrieves a fixed number of candidates
- PostgreSQL applies the WHERE filter afterward
- Only matching rows are returned, which can result in fewer rows than requested
For example, if the index examines about 40 candidates (ef_search = 40, the default) and the filter matches one third of the rows, we would expect about 13 matching rows on average. This is approximate because the index examines a local region of the vector space, where the metadata distribution may differ from the global distribution.
In this execution plan, 40 candidates were visited (12 returned, 28 filtered out), but only 12 matched the predicate, falling short of the query's LIMIT 15. These candidates are not necessarily the closest neighbors that satisfy the filter. The true nearest neighbors with color = 'red' might lie outside the explored region and were therefore not considered, because the region included other colors.
As a result, fewer than LIMIT rows may be returned, and even those may not be the actual nearest neighbors within the filtered subset. This explains the drop in recall, which results from post-filtering.
In the execution plan, this is evident as the condition on color appears as a Filter applied after the index scan:
Limit (actual time=2.352..2.496 rows=12 loops=1)
Buffers: shared hit=1775
-> Index Scan using i_hnsw on embeddings_table (actual time=2.351..2.493 rows=12 loops=1)
Order By: (embedding <=> '[0.51098114,0.8720485,0.8568168,...,0.110820755]'::vector)
Filter: (color = 'red'::text)
Rows Removed by Filter: 28
Buffers: shared hit=1775
Planning:
Buffers: shared hit=1
Planning Time: 0.054 ms
Execution
The execution plan details show what happened: 12 returned + 28 filtered = 40 candidates, which is ef_search.
HNSW applies filtering after traversal, meaning the graph is searched globally before rows are discarded. This breaks the assumption that the closest neighbors are among the candidates returned by the index, especially when using selective filters, as they often are not.
The mitigations include:
- Increasing
ef_search. For instance, based on our data distribution, settinghnsw.ef_search = 200could provide enough candidates so that at least 15 remain after filtering. - Using partial indexes when filtering on a small set of discrete, known values, so each is indexing a specific subset.
- Enabling iterative scans to expand the search scope when filtering removes too many candidates. This improves recall but requires multiple scans.
These measures improve recall, though they come with higher costs and require manual calibration because filtering is handled separately from the ANN search.
DiskANN on HorizonDB
Now I create a DiskANN index, designed for filtering during the search:
postgres=> CREATE EXTENSION IF NOT EXISTS pg_diskann;
CREATE EXTENSION
postgres=> CREATE INDEX i_diskann ON embeddings_table
USING diskann (embedding vector_cosine_ops)
;
CREATE INDEX
Total execution time: 00:42:40.800
In HorizonDB documentation, it is explained that DiskANN enables advanced filtering for combined vector and metadata queries. Depending on selectivity and planner estimates, it can apply filters during traversal or fall back to post-filtering
With DiskANN index - Approximate Nearest Neighbor Index Scan
After creating the index, I run the same query:
postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red'
ORDER BY embedding <=> :'query'
LIMIT 15
;
NOTICE: pg_diskann: Filter selectivity too high (0.3325), skipping filtered vector scan
id | color | nn_cosine
---------+-------+---------------------
1968337 | red | 0.24740531343440575
545257 | red | 0.22958854707242105
828484 | red | 0.24054936837179086
744973 | red | 0.21565428393281894
307819 | red | 0.23955190885002464
557251 | red | 0.2158213914722017
1860280 | red | 0.24081963623331726
1794061 | red | 0.22253304456085543
1048324 | red | 0.24138402291179228
494461 | red | 0.23541082932461943
851149 | red | 0.26081572380948004
832813 | red | 0.2240000332741019
745516 | red | 0.24130828891293765
233791 | red | 0.2324244758968873
1247836 | red | 0.22905966001837208
(15 rows)
The note is interesting: pre-filtering was not used. Still, the result is complete because DiskANN dynamically adapts the number of candidates explored.
This means the filter was not applied during traversal, so the query falls back to post-filtering, as with pgvector. However, unlike HNSW, the number of candidates is not fixed: the engine explores more vectors until enough rows are returned.
In the execution plan, we see that 42 rows were filtered out and 15 were returned, meaning about 57 candidates were explored:
Limit (actual time=6.607..10.821 rows=15 loops=1)
Buffers: shared hit=5772
-> Index Scan using i_diskann on embeddings_table (actual time=6.606..10.817 rows=15 loops=1)
Order By: (embedding <=> '[0.51098114,0.8720485,0.8568168,0.3119626,...,0.110820755]'::vector)
Filter: (color = 'red'::text)
Rows Removed by Filter: 42
Buffers: shared hit=5772
Planning:
Buffers: shared hit=2
Planning Time: 0.104 ms
Execution Time: 10.893 ms
(11 rows)
By varying the LIMIT, we can see that even without pre-filtering, the results are complete and remain fast for common top-n queries:
| LIMIT | Rows Removed by Filter | rows= | Buffers | Execution Time |
|---|---|---|---|---|
| 1 | 1 | 1 | 3779 | 4.215 ms |
| 2 | 1 | 2 | 3790 | 6.431 ms |
| 5 | 22 | 5 | 4613 | 8.153 ms |
| 10 | 29 | 10 | 5310 | 9.841 ms |
| 15 | 42 | 15 | 5722 | 10.893 ms |
| 100 | 222 | 100 | 15364 | 39.989 ms |
| 1000 | 1879 | 1000 | 91585 | 826.303 ms |
| 10000 | 20202 | 10000 | 602407 | 27827.608 ms |
This distinction is crucial: pgvector uses a fixed search budget (ef_search), whereas DiskANN adjusts the number of explored candidates. Even with post-filtering, it can provide enough rows without manual tuning and without incremental scans.
Highly selective query - Exact Nearest Neighbor
If the filter is highly selective, there is no need to use an approximate search:
postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red' and id = 494461
ORDER BY embedding <=> :'query'
LIMIT 15
;
NOTICE: pg_diskann: Filter selectivity too low (0.0000), using filter-only scan
id | color | nn_cosine
--------+-------+---------------------
494461 | red | 0.23541082932461943
(1 row)
The execution plan shows that it didn't use the vector index but instead used the B-tree index on id with an additional Sort for exact nearest-neighbor ordering:
Limit (actual time=0.038..0.039 rows=1 loops=1)
Buffers: shared hit=12
-> Sort (actual time=0.037..0.038 rows=1 loops=1)
Sort Key: ((embedding <=> '[0.51098114,0.8720485,0.8568168,0.311962,...,0.110820755]'::vector))
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=12
-> Index Scan using embeddings_table_pkey on embeddings_table (actual time=0.033..0.034 rows=1 loops=1)
Index Cond: (id = 494461)
Filter: (color = 'red'::text)
Buffers: shared hit=12
Planning:
Buffers: shared hit=2
Planning Time: 0.140 ms
Execution Time: 0.053 ms
(14 rows)
DiskANN doesn’t even participate when the filtering is highly selective. The planner uses a B-tree index and bypasses ANN.
With DiskANN index - Pre-filtered Approximate Nearest Neighbor
When filters are selective enough to be applied during the scan but not enough to avoid ANN search, there's no NOTICE:
postgres=> SELECT id , color, embedding <=> :'query' nn_cosine
FROM embeddings_table
WHERE color = 'red' and id <= 494461
ORDER BY embedding <=> :'query'
LIMIT 15
;
id | color | nn_cosine
--------+-------+---------------------
418054 | red | 0.21522295998337881
233791 | red | 0.2324244758968873
377470 | red | 0.23508242287803505
196009 | red | 0.2356772244494152
225253 | red | 0.24119501226879858
233026 | red | 0.23055577367239388
308998 | red | 0.21652541614338927
291283 | red | 0.2319492505072145
459442 | red | 0.2448930593940325
296956 | red | 0.21969539585363562
223876 | red | 0.24664282569002072
385669 | red | 0.21339531139470636
390397 | red | 0.2484855378229266
296467 | red | 0.22426404565684388
220285 | red | 0.2179154018451408
(15 rows)
This is where the execution plan displays a Custom Scan (DiskANNFilteredScan), enabled by PostgreSQL's extensibility. It performs pre-filtering (Filter(IndexScan)) and Approximate Nearest Neighbor search (Vector) to yield the precise number of candidates in the filtered set (Rows Retrieved: 15 count):
QUERY PLAN
--------------------------------------------------------------------------------------
Limit (actual time=25.842..28.876 rows=15 loops=1)
Buffers: shared hit=11905
-> Custom Scan (DiskANNFilteredScan) (actual time=25.841..28.872 rows=15 loops=1)
Strategy: Filter(IndexScan) -> Vector
Rows Retrieved: 15 count
TIDs Collected: 303062 count
Buffers: shared hit=11905
Planning:
Buffers: shared hit=2
Planning Time: 0.091 ms
Execution Time: 28.927 ms
(11 rows)
This plan is an ANN with pre-filtering, effective when both the column filter and the embedding filter are selective and need to be combined for efficient execution. The optimizer makes this decision dynamically.
TIDs Collected indicates the number of rows matching the indexable portion of the filter prior to the vector search. According to the cardinalities, the id condition was used for prefiltering:
postgres=> select count(*) FROM embeddings_table
WHERE id <= 494461;
count
--------
303063
This represents a selectivity of about 15% for the pre-filtered set. The remaining condition (color = 'red') is applied afterward, reducing the result further to about 5% of the table.
Why this matters
These experiments reveal a different execution model, not just performance differences. With pgvector and HNSW, search is drive... (truncated)
June 20, 2026
Our Collective Bike Shed Moment
In 1957, Parkinson postulated his "Law of Triviality" using a fictitious committee reviewing plans for a nuclear power plant. The reactor design gets 10 minutes because nobody understands it, so nobody argues. The bike shed gets 45 minutes because everyone has opinions about the paint color.
I feel like we are living this committee meeting at scale every day.
LLMs are already better engineers than most of us. They are better at formal methods, and better at reasoning under pressure than most people. They run at incredible speed and don't get tired. They improve continuously. But, some people keep moving the goalposts on LLMs. First they said LLMs couldn't code. Then they said they hallucinated too much. All of these barriers fell, but some people are still scoffing at these systems. What chutzpah!
If aliens landed in Central Park tomorrow, I don't think the reaction would be that different. With AI, an alien form of intelligence has already arrived in our laps, and we gave it a collective shrug and kept scrolling.
Psychologists call this the normalcy bias. We tend to assume things will continue roughly as they have, even when confronted with something that requires revising our picture of the future.
Douglas Adams introduces an adjacent phenomena, called the Somebody Else's Problem (SEP) field. The SEP field does not try to make something invisible directly, rather it makes your brain classify it as somebody else's problem, so it gets actively skipped. An alien spacecraft hovers over a crowded park unnoticed, because everyone has silently agreed it's someone else's problem.
The individual adaptation strategy makes sense of course. Learn the tools, and stay ahead of the curve to save your own ship. Why not? Stay afloat the next couple years. But what about the coming decades?
I am not saying everyone should engage with everything. I know about the circles of control, influence, and concern. But, unfortunately, even the people who should be debating this (technologists, researchers, economists, policymakers, ethicists, military planners, educators) are not engaging with this problem at the depth it deserves.
We are having the bike shed debate about the reactor, and unfortunately the reactor design doesn't get enough attention.
CPU-bound sysbench on a large server: Postgres 12 to 19 beta1
This has results from sysbench on a small server with Postgres versions 12 through 19 beta1. Sysbench is run with high concurrency (40 connections) and a cached database. The purpose is to search for changes in performance.
Postgres remains boring, it is hard to find performance regressions.
tl;dr for Postgres 17 to 19
- there are no regressions
- throughput on the read-only-count test improves by ~3X in 19 beta1 thanks to a better query plan
tl;dr for Postgres 12 to 19
- there are few regressions, throughput might have dropped by up to 5% on a few range query tests
- there are a few large improvements for read-only tests
- there are many large improvements for write-heavy tests
- an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
- 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
- 128G RAM
- Ubuntu 24.04
The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 1200 seconds. The benchmark is run with 40 clients and 8 tables with 10M rows per table. The database is cached.
I provide charts below with relative QPS (rQPS). The relative QPS is the following:
(QPS for some version) / (QPS for base version)
I describe performance changes (changes to relative QPS) in terms of basis points. Performance changes by one basis point when the difference in rQPS is 0.01. When rQPS decreases from 0.95 to 0.85 then it changed by 10 basis points.
- there are no regressins
- there are no regressions
- throughput for the hot-points test improves by ~2X in versions 17.10, 18.4 and 19beta
- there are no regressions
- while 19 beta1 has a better result on the scan test, that test has more variance with Postgres so I am reluctant to judge this without more results
- there are no regressions
- scan throughput has improved a lot from version 12 to 19
- there are no regressions
- throughput on the read-only-count test is ~3X better thanks to a new query plan. This improvement was also visible on my small server
- there might be a few small regressions, but losing 5% throughput from version 12 to 19 isn't a big deal
- throughput on the read-only-count test is ~3X better thanks to a new query plan. This improvement was also visible on my small server
- there are no regressions
- there are no regressions
- many large improvements arrived in version 17 and remain in 19 beta1
June 17, 2026
PostgreSQL 18 on Amazon Aurora and Amazon RDS: Performance enhancements
PostgreSQL 18 on Amazon Aurora and Amazon RDS: Security, monitoring, and developer enhancements
HorizonDB cache hierarchy: RAM, NVMe SSD, and multi-AZ storage behind PostgreSQL
HorizonDB extends PostgreSQL with disaggregated storage, making it appear to developers like PostgreSQL while offering cloud-native high availability, elasticity, and performance. Performance metrics, especially I/O, can reveal differences due to additional components such as local caches and remote storage along the read/write paths.
Three aspects can expose these internal workings at the PostgreSQL layer:
Cache utilization, observable via EXPLAIN BUFFERS, comparing shared buffer hits with read calls to retrieve pages.
Average read times from EXPLAIN or metrics help infer the storage layer: less than a microsecond for memory, tens of microseconds for local NVMe SSD cache, and about a millisecond for remote persistent storage.
Wait events, as I/O operations handled outside PostgreSQL's main code path, in extensions, may show different patterns in pg_stat_activity compared to traditional PostgreSQL setups.
This approach is similar to tests I previously performed on OCI and AWS. I insert 100 MB of data (~12,800 rows) into a PostgreSQL table every two minutes and run full table scans with EXPLAIN ANALYZE:
\timing on
-- create the table
drop table if exists large;
create table large ( filler text ) with (fillfactor=10);
-- insert 100 MB
insert into large
select string_agg( chr((32+random()*94)::int) ,'' )
from generate_series(1,1000) c, generate_series(1,12800) r
group by r
\;
-- check the size
select 'size', pg_table_size('large') , pg_size_pretty( pg_table_size('large') ) , now()
\;
-- query with full table scan
explain (costs off, analyze, buffers) select * from large
-- repeat every two minutes
\watch i=120 c=1000
I log the execution plans, including the elapsed time for the sequential scan and the number of buffers accessed (shared buffer hits and reads). Some metrics are gathered in Azure, and I also run the PostgreSQL for VS Code extension dashboard, which shows other metrics exposed by PostgreSQL.
It starts with 100% shared buffer hits
The storage grows by 100 MB every two minutes as new inserts are added:
The inserted rows remain constant, but as the table size grows, the read workload increases because more rows are scanned during each execution:
Since this involves a full table scan and the table size grows, it reads an increasing number of blocks, all resulting in shared buffer hits without reading from files, as the inserted pages remain in shared buffers:
The sequential scan in PostgreSQL employs a small ring buffer, but this occurs only on cache misses. Since the pages are already present in the shared buffer from prior insertions, this does not apply here.
This continues, and the table size reaches the shared buffer size.
The inserts fill up the shared buffer (11 GiB)
I run on a HorizonDB instance with 16 GiB RAM, provisioned like in the previous post, and 11 GiB are allocated to the shared buffer:
postgres=> \dconfig shared*
List of configuration parameters
Parameter | Value
----------------------------------+------------------------------------------------------------------------------
shared_buffers | 11241MB
shared_memory_size | 11776MB
shared_memory_size_in_huge_pages | 5888
shared_memory_type | mmap
Until I added 11 GB, the sequential scan only involved shared hit buffers and did not perform any reads (1433600 x 8KB blocks equals 11,200 MB):
Tue 16 Jun 2026 10:33:01 AM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 11747319808 | 11 GB | 2026-06-16 10:33:01.555657+00
(1 row)
Tue 16 Jun 2026 10:33:01 AM GMT (every 120s)
QUERY PLAN
---------------------------------------------------------------------
Seq Scan on large (actual time=0.005..507.198 rows=1433600 loops=1)
Buffers: shared hit=1433600 dirtied=12800
Planning Time: 0.013 ms
Execution Time: 551.265 ms
(4 rows)
Time: 4724.370 ms (00:04.724)
The next run, with 100 MB more data, begins to show some buffers read (16071 x 8KB = 125MB):
Tue 16 Jun 2026 10:35:01 AM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 11852201984 | 11 GB | 2026-06-16 10:35:01.555654+00
(1 row)
This is consistent with the newly appended pages, 100 MB, displacing the same amount of older buffers once shared buffers were full.
Tue 16 Jun 2026 10:35:01 AM GMT (every 120s)
QUERY PLAN
---------------------------------------------------------------------
Seq Scan on large (actual time=0.007..624.128 rows=1446400 loops=1)
Buffers: shared hit=1430329 read=16071 dirtied=12800 written=753
Planning Time: 0.022 ms
Execution Time: 669.210 ms
(4 rows)
Time: 4667.335 ms (00:04.667)
In vanilla PostgreSQL, sequential scans use a bulk-read strategy with a small ring of shared buffers, typically 256KB, to avoid flooding the main shared-buffer cache. With buffered I/O, pages not found in shared buffers may still be served from the operating system page cache, which is the usual double-buffering behavior. In such an experiment, the read (cache misses) increase by 100 MB each time due to buffer evictions from newly inserted data. I observed that on other managed services, asymptotic throughput decayed while cache evictions increased:
In HorizonDB, where the caching layers are different, we do not observe the same gradual decay. After the table scan size exceeded the shared-buffer size, subsequent scans were mostly reported as reads by PostgreSQL, indicating that most older pages were no longer in shared buffers, while only the most recently inserted 100 MB remained as hits. The next run reads the full table via disk reads (1446401 x 8KB = 11300 MB), with only the last 100 MB served from the shared buffer (12799 x 8KB = 100 MB):
Tue 16 Jun 2026 10:37:01 AM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 11957084160 | 11 GB | 2026-06-16 10:37:01.556217+00
(1 row)
Tue 16 Jun 2026 10:37:01 AM GMT (every 120s)
QUERY PLAN
-----------------------------------------------------------------------
Seq Scan on large (actual time=2.825..82165.728 rows=1459200 loops=1)
Buffers: shared hit=12799 read=1446401 dirtied=12949
Planning Time: 0.019 ms
Execution Time: 82219.965 ms
(4 rows)
Time: 86399.819 ms (01:26.400)
PostgreSQL BUFFERS only shows the shared-buffer view. A read is a miss from shared buffers, but in a disaggregated architecture, it may still be served by a local cache before reaching durable remote storage.
Let's do some maths to estimate the storage layer of these buffer reads, considering that most of the per-buffer processing time is dominated by access latency:
- Shared buffer hits: reading 11 GB in 507ms results in roughly 0.00038 ms per I/O (~0.38 µs), indicative of RAM access.
- Buffer reads: reading 11 GB in 1 minute and 26 seconds translates to 134 MB/s, approximately 16,000 IOPS with 8KB I/Os. Since the query is single-threaded and has no parallelism, the average I/O time is about 0.0568 ms (~57 µs). This reflects reading from the local NVMe SSD cache.
I've highlighted the execution plans at the key inflection point. We can review those metrics throughout the entire run using the VS Code dashboard.
Because this is the only table in the database, the total size increases by 100 MB every two minutes, reaching 11 GB:
As table scans increase, shared buffer hits or reads also rise. When the table size hits 11 GB, all reads are served by storage (read, no longer from shared buffers), replacing the earlier 100% cache hits.
The cache hit ratio shows when it falls below 100%:
The query execution time increases from 500ms with RAM to 1 minute with an SSD:
Here is another EXPLAIN showing when those metrics were taken:
Tue 16 Jun 2026 10:55:01 AM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 12901031936 | 12 GB | 2026-06-16 10:55:01.555546+00
(1 row)
Tue 16 Jun 2026 10:55:01 AM GMT (every 120s)
QUERY PLAN
-----------------------------------------------------------------------
Seq Scan on large (actual time=0.114..71725.275 rows=1574400 loops=1)
Buffers: shared hit=12800 read=1561600 dirtied=12800
Planning Time: 0.022 ms
Execution Time: 71781.188 ms
(4 rows)
Time: 76089.679 ms (01:16.090)
Reading 12 GB in 1 minute and 12 seconds results in approximately 166 MB/s, corresponding to around 20,000 IOPS with 8 KB I/O operations and an average of 0.046 ms per I/O (~46 microseconds). This aligns with reads served from the local NVMe SSD cache rather than the remote durable storage path.
Even if these reads are not counted as cache hits by PostgreSQL metrics—which only track the shared buffers—they still constitute a highly effective local cache, achieving 10 GB reads in just one minute. This provides faster I/O without requiring larger compute instances.
Next cache level: local NVMe SSD
While the run was ongoing, I observed the query duration:
There were some peaks in response time that temporarily reduced throughput in inserted and fetched rows:
The same is visible from the block reads:
I gathered the execution plans from that period, showing a single run at 11:21 UTC (13:21 CET on the dashboard), with reduced throughput—reading 11 GB over 5 minutes instead of 1 minute.
Tue 16 Jun 2026 11:17:01 AM GMT (every 120s)
QUERY PLAN
-----------------------------------------------------------------------
Seq Scan on large (actual time=0.111..77440.938 rows=1715200 loops=1)
Buffers: shared hit=14286 read=1700914 dirtied=12800 written=12555
Planning Time: 0.022 ms
Execution Time: 77503.429 ms
(4 rows)
Time: 81571.846 ms (01:21.572)
INSERT 0 12800
Tue 16 Jun 2026 11:19:01 AM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-----------------------------
size | 14159659008 | 13 GB | 2026-06-16 11:19:01.5558+00
(1 row)
Tue 16 Jun 2026 11:19:01 AM GMT (every 120s)
QUERY PLAN
-----------------------------------------------------------------------
Seq Scan on large (actual time=0.109..62897.716 rows=1728000 loops=1)
Buffers: shared hit=345599 read=1382401
Planning Time: 0.022 ms
Execution Time: 62958.979 ms
(4 rows)
Time: 66659.325 ms (01:06.659)
INSERT 0 12800
Tue 16 Jun 2026 11:21:01 AM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 14264598528 | 13 GB | 2026-06-16 11:21:01.555743+00
(1 row)
Tue 16 Jun 2026 11:21:01 AM GMT (every 120s)
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on large (actual time=4.852..288105.348 rows=1740800 loops=1)
Buffers: shared hit=313478 read=1427322 dirtied=12800 written=22539
Planning:
Buffers: shared hit=2
Planning Time: 0.029 ms
Execution Time: 288179.656 ms
(6 rows)
Time: 292242.942 ms (04:52.243)
INSERT 0 12800
Tue 16 Jun 2026 11:25:53 AM GMT (every 120s)
?column? | pg_table_size | pg_size_pretty | now
----------+---------------+----------------+-------------------------------
size | 14369480704 | 13 GB | 2026-06-16 11:25:53.798728+00
(1 row)
Tue 16 Jun 2026 11:25:53 AM GMT (every 120s)
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on large (actual time=30.632..87584.215 rows=1753600 loops=1)
Buffers: shared hit=14238 read=1739362 dirtied=12800 written=11000
Planning Time: 0.015 ms
Execution Time: 87647.389 ms
(4 rows)
Time: 91315.980 ms (01:31.316)
Doing the maths, 1,427,322 reads in 288,105 milliseconds yields an average of roughly 288,105 ms / 1,427,322 ≈ 0.202 ms per read. This indicates that the scan is no longer fully served by the fastest local SSD-cache path, and some reads fall through to a slower storage path.
I reviewed the HorizonDB compute instance metrics in the Azure portal. The Azure portal “Storage” metric shown here does not track the growing table size nor the apparent local read-cache behavior in this experiment, as it remains steady at 120 MB:
The inserted and returned tuples align with our observations. Variations are expected due to the bursty workload: the query executes for about a minute, then pauses for two minutes, causing the 10-second metric buckets to capture the active rate during execution and near-zero during idle times.
The pattern shifts when the query time matches the job's 2-minute wait time. Here is the query time:
The memory usage has been rising until the shared buffers reached their allocated size (Linux allocates lazily):
Finally, it is interesting to observe the network I/O from the compute instance, which relates to the read and write operations to the storage:
The slower run can be explained by some reads taking a lower cache tier or being delayed by contention/throttling. The correlation with network activity makes remote storage involvement plausible, possibly due to SSD cache misses.
Returning to the VS Code dashboard, we can also observe the statistics about vacuum, checkpoint, and WAL:
On the dashboard, the average active-session timeline is visible, along with wait-event names for reads. These suggest that reads are handled by an extension rather than the PostgreSQL core code, as they appear as HorizonDB_SSDCache_Read wait events under the Extension wait type.
Some observations of HorizonDB_Storage_Read indicate longer read times and increased network activity.
Conclusion
These experiments show that HorizonDB behaves like PostgreSQL at the SQL layer while exposing the effects of a multi-tier storage architecture through latency, throughput, and wait events.
With a monotonically increasing table size scanned, we have observed the following phases:
| Phase | PostgreSQL view | Effective per 8 KiB block | HorizonDB reality |
|---|---|---|---|
| Fits in shared buffers | shared hit |
~0.35–0.38 µs | PostgreSQL shared-buffer residency |
| After shared-buffer overflow, normal runs | mostly read
|
~46–57 µs | Local low-latency cache below PostgreSQL |
| “Slower” peaks | mostly read
|
~0.20 ms | Slower storage path, including remote-storage reads |
The configuration (large shared_buffers and effective_cache_size equal to it) shows that HorizonDB minimizes reliance on the OS page cache and instead uses a multi-tier caching strategy, with a fast NVMe SSD in the stateless compute node, to reduce reads from durable remote storage.
From PostgreSQL’s point of view, the scan is still a normal sequential scan, and BUFFERS still reports only the PostgreSQL shared-buffer state. When the table fits in shared buffers, the scan is served as shared hits and completes in a few hundred milliseconds. Once the table grows beyond shared buffers, PostgreSQL reports mostly read buffers. In a disaggregated storage architecture, those reads do not necessarily mean cold remote storage. They may be served from a lower cache tier beneath PostgreSQL.
The effective per-block timings show three distinct regimes in this test: sub-microsecond effective access times when pages are already in shared buffers, tens of microseconds per PostgreSQL block during normal post-overflow scans, and occasional slower periods of a few tenths of a millisecond per block. These slower periods correlate with longer query times and increased network activity, which makes lower-tier reads or storage-path delays plausible, although this experiment alone cannot distinguish cache misses from contention, throttling, or other synchronization effects.
The important lesson is that PostgreSQL cache-hit ratio alone is not enough to understand performance in a disaggregated system. A drop from shared hit to read may look dramatic in PostgreSQL metrics, but it means “not in shared buffers”, not necessarily “read from remote durable storage” in HorizonDB. The observed latency and wait events indicate an additional cache tier below PostgreSQL, where normal reads are served from a fast local NVMe SSD cache, while slower peaks use a slower storage path.
HorizonDB is in preview on Azure, so those metrics may improve, and your feedback is welcome.
Deep dive into Amazon Aurora PostgreSQL lock analysis with CloudWatch Database Insights
Security advisory: CVE-2026-9740 and CVE-2026-11933 in Percona Server for MongoDB
TL;DR: This advisory covers the two most important high-severity memory-safety vulnerabilities affecting MongoDB Community and our downstream Percona Server for MongoDB – CVE-2026-11933 and CVE-2026-9740. Both will be addressed in a single coordinated patch release, bundled with other recently revealed lower-scored CVE fixes: CVE-2026-9753, CVE-2026-9752, CVE-2026-9751, CVE-2026-9750, CVE-2026-9749, CVE-2026-9748, CVE-2026-9747, CVE-2026-9746, CVE-2026-9743, and CVE-2026-9741. Fixes land … Continued
The post Security advisory: CVE-2026-9740 and CVE-2026-11933 in Percona Server for MongoDB appeared first on Percona.
Fuzzy String Search for MySQL
The insert benchmark on a small server, IO-bound workload : Postgres 19 beta1
This has results for Postgres versions 19 beta1, 18.4 and 17.10 with the Insert Benchmark on a small server using a cached and CPU-bound workload. I also used MySQL 8.4.8 to see where performance was different.
Postgres continues to be boring in a good way. It is hard to find performance regressions.
tl;dr
- create index (the l.x step) is faster in Postgres 19beta1. A Postgres expert told me that the sort algorithm was changed to be more CPU efficient
- the write heavy steps (l.i1, l.i2) are 15% and 9% faster in 19 beta1 vs Postgres 17.10
- the second write heavy step (l.i2) is more than 20X faster in MySQL 8.4.8 vs Postgres thanks to the CPU overhead from get_actual_variable_range. I have written about this before.
I compiled MySQL 8.4.8 from source as well.
The server is an Beelink SER7 with a Ryzen 7 7840HS CPU with 8 cores and AMD SMT disabled, 32G of RAM. Storage is one SSD for the OS and an NVMe SSD for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04.
- l.i0
- insert 800M rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.
- l.x
- create 3 secondary indexes per table. There is one connection per client.
- l.i1
- use 2 connections/client. One inserts 4M rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.
- l.i2
- like l.i1 but each transaction modifies 5 rows (small transactions) and 1M rows are inserted and deleted per table.
- Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.
- qr100
- use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.
- qp100
- like qr100 except uses point queries on the PK index
- qr500
- like qr100 but the insert and delete rates are increased from 100/s to 500/s
- qp500
- like qp100 but the insert and delete rates are increased from 100/s to 500/s
- qr1000
- like qr100 but the insert and delete rates are increased from 100/s to 1000/s
- qp1000
- like qp100 but the insert and delete rates are increased from 100/s to 1000/s
(QPS for my version / QPS for Postgres 17.10)
The background in the table cells is blue for big improvements and yellow for regressions. There are no regressions here.
- there are large improvements in 19 beta1 (15% and 9%). The CPU overhead is lower in 19 beta1 compared to 17.10 (see cpupq here).
- throughput for the l.i2 step is more than 20X larger for MySQL than for Postgres. From vmstat I see that the CPU overhead (cpupq here) is more than 10X larger with Postgres vs MySQL. From flamegraphs the problem is the CPU overhead in get_actual_variable_range. I have written about this before (see here). The Postgres query planner uses too much CPU skipping old versions to figure out selectivity for a query and there are too many old versions because Postgres doesn't collect them ASAP, vacuum takes time. The flamegraphs are in subdirectories here.
| dbms | l.i0 | l.x | l.i1 | l.i2 | qr100 | qp100 | qr500 | qp500 | qr1000 | qp1000 |
|---|---|---|---|---|---|---|---|---|---|---|
| PG 17.10 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| PG 18.4 | 1.01 | 1.03 | 1.00 | 1.00 | 0.98 | 1.00 | 0.99 | 0.99 | 0.99 | 0.99 |
| PG 19 beta1 | 1.01 | 1.15 | 1.05 | 1.09 | 0.97 | 1.01 | 0.96 | 1.01 | 0.97 | 1.00 |
| MySQL 8.4.8 | 0.77 | 0.89 | 0.76 | 21.62 | 0.61 | 1.07 | 0.66 | 0.93 | 0.85 | 0.84 |
June 16, 2026
Our Italy trip
We spent 10 days in Italy in early April. It was a lot of fun.
Back when I was a grad student, I stayed in Pisa for a month for a "summer school on mobile computing" in 2003. Mobile computing was the next "big" thing back then. (In retrospect, the research was misguided about the local/distributed approach to it, as most of mobile computing reconciled to cloud backends as that is often more efficient.) When I was in Pisa, I didn't travel around much, but I now realize I should have spent every weekend traveling. (In my defense, without Internet enabling trip planning/execution, and without GPS on the phone, traveling was very cumbersome, yes, back in 2003.)
Rome
Rome is amazing! The history is incredibly well preserved. Kudos to the Italians.
Rome is very walkable. The buses, on the other hand, were packed solid. My daughters were genuinely surprised by this exotic new form of transportation that does not exist in the US.
The art in the Vatican was exquisite. All the masters, all in one place. But the lines were horrendous.
Tiramisu is truly wonderful. But (please don't stone me for this) I don't like the coffee. I had promised myself I would drink coffee everywhere, and I tried. But the coffee, even the espresso, just didn't feel that good to me. It was mostly lukewarm. I've gotten used to hot coffee, so I quickly gave up my quest to find my coffee high in Italy.
As I broke the news to you in this blog post (a post about Jensen Huang, of all things), I got pickpocketed at the Trevi Fountain.
"Here's the strange part. After the initial shock and the credit-card cancellations, I found myself thinking: this person was unbelievably good at their job. I felt nothing. No bump, no distraction, nothing. The wallet disappeared from the deep front pocket of my jeans like a magic trick. Respect. I remember thinking: if only I were that good at my own work."
Venice
Venice is simply beautiful. The buildings, the canals, the whole scene were beautiful. The old town doesn't even allow motorcycles, so everything is foot-powered. Everything is well preserved.
Lake Como
Lake Como looked out of this world. It was trippy.
But there were lines everywhere. The ferries ran very inefficiently, and 2-hour wait times were the norm. And this wasn't even peak season. Italians are not the best at organizing/running things. In Italy there are lines everywhere: cafes, restaurants, ferries. Everywhere. A good chunk of my time in Italy was spent waiting in lines.
Turin
Turin is bigger and more metropolitan. It is very clean. I didn't get to see much of it, unfortunately.
The night before our early flight back to the States, we stayed near the Milan airport, in a little town called Somma Lombardo. Even that town was beautiful, with buildings dating from the 1500s, all in remarkably well-preserved shape.
So, what happened to Italy?
Italy was very active at the start of the tech revolution. Olivetti could have been the center of gravity for the personal computer revolution, you know. And then there are these magnificent brands from Italy, including Ferrari, Lamborghini, Fiat, Alfa Romeo. So why don't we see any more of that innovation?
Improve query performance with EXPLAIN plans in Amazon Aurora DSQL
Extending pt-archiver with a Partition-Aware Plug-in for Fast Retention Policy Enforcement
Managing data retention policies is one of the most common operational tasks in MySQL. Applications continuously generate transactional, audit, logging, telemetry, and event data. Over time, these tables can grow to billions of rows, causing: Larger backups Longer recovery times Reduced buffer pool efficiency Slower index maintenance Increased storage costs Degraded query performance To address … Continued
The post Extending pt-archiver with a Partition-Aware Plug-in for Fast Retention Policy Enforcement appeared first on Percona.