December 24, 2025
Adaptive Join in Amazon Aurora PostgreSQL
Slow joins in PostgreSQL often result from a nested loop join chosen by the query planner, which estimates a few rows but processes hundreds of thousands. System metrics like buffer cache hit ratio are all green, but it reads more data than necessary and takes longer. This can happen suddenly because the join method decision is sensitive: a single row difference can trigger a shift from a hash join to a nested loop.
Although cost-based optimizers are common, alternative strategies exist. For example, MongoDB's multi-planner postpones index selection until execution, testing all options and switching to the best after a short trial. Likewise, Oracle Database can delay selecting the join method or parallel query distribution by buffering rows before determining the plan for the remaining data. Amazon Aurora implements a similar approach called adaptive join, which defers the decision between nested loop and hash join until execution.
Here is an example of Amazon Aurora adaptive plans. I used the same tables as in the previous post with two additional indexes:
CREATE INDEX ON outer_table (a,b,id);
CREATE INDEX ON inner_table (id,b);
I executed the following query:
explain (analyze)
SELECT o.b,i.b
FROM outer_table o
JOIN inner_table i USING(id)
WHERE o.a<10 AND o.b<10
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.57..124.76 rows=4 width=8) (actual time=0.020..0.119 rows=9 loops=1)
-> Index Only Scan using outer_table_a_b_id_idx on outer_table o (cost=0.29..58.71 rows=20 width=8) (actual time=0.011..0.061 rows=28 loops=1)
Index Cond: ((a < 10) AND (b < 10))
Heap Fetches: 0
-> Index Only Scan using inner_table_id_b_idx on inner_table i (cost=0.29..3.30 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=28)
Index Cond: (id = o.id)
Heap Fetches: 0
Planning Time: 0.321 ms
Execution Time: 0.152 ms
(9 rows)
Because I had optimal indexes and not too many rows in the outer table (estimated rows=20), the query planner chose a nested loop join. During execution, there were more rows than estimated (actual rows=28). It might still be an effective join method with not too many inner loops (loops=28). However, what happens if the actual number of rows is much higher?
For instance, increasing the range on "o.b" causes the query planner to swap the join inputs and select a hash join.
explain (analyze)
SELECT o.b,i.b
FROM outer_table o
JOIN inner_table i USING(id)
WHERE o.a<10 AND o.b<1000
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=108.48..289.73 rows=444 width=8) (actual time=0.656..2.426 rows=900 loops=1)
Hash Cond: (i.id = o.id)
-> Seq Scan on inner_table i (cost=0.00..155.00 rows=10000 width=8) (actual time=0.006..0.697 rows=10000 loops=1)
-> Hash (cost=80.72..80.72 rows=2221 width=8) (actual time=0.624..0.624 rows=2250 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 120kB
-> Index Only Scan using outer_table_a_b_id_idx on outer_table o (cost=0.29..80.72 rows=2221 width=8) (actual time=0.019..0.315 rows=2250 loops=1)
Index Cond: ((a < 10) AND (b < 1000))
Heap Fetches: 0
Planning Time: 0.901 ms
Execution Time: 2.522 ms
(10 rows)
Instead of starting a nested loop from "outer_table," this approach loads the entire "outer_table" into a build table using hashing and starts the probe from "inner_table." Although this initial step takes longer to build the hash table, it prevents running 2,000 inner loops, as I can verify by disabling all other methods.
set enable_hashjoin to off;
set enable_mergejoin to off;
explain (analyze)
SELECT o.b,i.b
FROM outer_table o
JOIN inner_table i USING(id)
WHERE o.a<10 AND o.b<1000
;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.57..872.57 rows=444 width=8) (actual time=0.610..4.871 rows=900 loops=1)
-> Index Only Scan using outer_table_a_b_id_idx on outer_table o (cost=0.29..80.72 rows=2221 width=8) (actual time=0.039..0.334 rows=2250 loops=1)
Index Cond: ((a < 10) AND (b < 1000))
Heap Fetches: 0
-> Index Only Scan using inner_table_id_b_idx on inner_table i (cost=0.29..0.36 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=2250)
Index Cond: (id = o.id)
Heap Fetches: 0
Planning Time: 4.991 ms
Execution Time: 5.670 ms
(9 rows)
With accurate cardinality estimates, the query planner can select the optimal join method — but during execution, these estimates can sometimes be significantly inaccurate. This is where an adaptive plan can assist—not necessarily to find the perfect plan, but to prevent the worst-case scenarios.
I re-enable all join methods, activate the adaptive plan, and rerun my initial query, which retrieves 28 rows from the outer table.
set enable_hashjoin to on;
set enable_mergejoin to on;
set apg_adaptive_join_crossover_multiplier to 1;
set apg_enable_parameterized_adaptive_join to on;
explain (analyze)
SELECT o.b,i.b
FROM outer_table o
JOIN inner_table i USING(id)
WHERE o.a<10 AND o.b<10
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (Adaptive) (cost=0.57..124.76 rows=4 width=8) (actual time=4.559..4.609 rows=9 loops=1)
Adaptive Crossover: rows=74
-> Index Only Scan using outer_table_a_b_id_idx on outer_table o (cost=0.29..58.71 rows=20 width=8) (actual time=1.999..3.261 rows=28 loops=1)
Index Cond: ((a < 10) AND (b < 10))
Heap Fetches: 0
-> Index Only Scan using inner_table_id_b_idx on inner_table i (cost=0.29..3.30 rows=1 width=8) (actual time=0.047..0.047 rows=0 loops=28)
Index Cond: (id = o.id)
Heap Fetches: 0
Planning Time: 2.107 ms
Execution Time: 4.648 ms
(10 rows)
It's still a nested loop with the same cost estimate as before, but it is now flagged as (Adaptive) and shows an extra detail: Adaptive Crossover: rows=74.
This indicates that the query planner found a nested loop to be cheaper than a hash join for the originally estimated number of iterations (rows=20). At planning time, it also computed the cost for higher row counts and identified a crossover point at rows=74, beyond which a hash join would have been cheaper and therefore selected. In other words, the planner pre-calculated an inflection point at which it would prefer a hash join and deferred the final choice to execution time.
At runtime, the rows read from outer_table are counted and buffered. Because the row count never reached the crossover/inflection point, the plan continued using the nested loop.
To see how the plan changes with more qualifying rows, I updated my data so that more rows satisfy the predicate a < 10 AND b < 10:
UPDATE outer_table SET b=0 WHERE a<10 AND b BETWEEN 10 AND 40;
UPDATE 47
I ran my query again. It’s still (Adaptive) with Adaptive Crossover: rows=74, but now it shows a Hash Join:
explain (analyze)
SELECT o.b,i.b
FROM outer_table o
JOIN inner_table i USING(id)
WHERE o.a<10 AND o.b<10
;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (Adaptive) (cost=58.96..240.21 rows=4 width=8) (actual time=2.531..3.801 rows=28 loops=1)
Output: o.b, i.b
Inner Unique: true
Adaptive Crossover: rows=74
Hash Cond: (i.id = o.id)
-> Seq Scan on public.inner_table i (cost=0.00..155.00 rows=10000 width=8) (actual time=0.007..0.628 rows=10000 loops=1)
Output: i.id, i.a, i.b
-> Hash (cost=58.71..58.71 rows=20 width=8) (actual time=2.470..2.470 rows=75 loops=1)
Output: o.b, o.id
-> Index Only Scan using outer_table_a_b_id_idx on public.outer_table o (cost=0.29..58.71 rows=20 width=8) (actual time=1.103..1.280 rows=75 loops=1)
Output: o.b, o.id
Index Cond: ((o.a < 10) AND (o.b < 10))
Heap Fetches: 57
Query Identifier: 8990309245261094611
Planning Time: 1.674 ms
Execution Time: 3.861 ms
(16 rows)
At planning time, the decision remained the same as before because the statistics had not changed (the estimate was still cost=0.29..58.71 rows=20). In reality, though, more than 74 rows were read from outer_table (actual rows=75), and instead of being used for a nested loop, the buffered rows were used as the build table of a hash join.
I then analyzed the table to see what would happen with fresh statistics, and was surprised to find the plan reverted to a nested loop:
analyze outer_table ;
explain (analyze)
SELECT o.b,i.b
FROM outer_table o
JOIN inner_table i USING(id)
WHERE o.a<10 AND o.b<10
;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (Adaptive) (cost=0.57..145.25 rows=5 width=8) (actual time=0.124..0.248 rows=28 loops=1)
Adaptive Crossover: rows=78
-> Index Only Scan using outer_table_a_b_id_idx on outer_table o (cost=0.29..70.29 rows=23 width=8) (actual time=0.026..0.104 rows=75 loops=1)
Index Cond: ((a < 10) AND (b < 10))
Heap Fetches: 57
-> Index Only Scan using inner_table_id_b_idx on inner_table i (cost=0.29..3.26 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=75)
Index Cond: (id = o.id)
Heap Fetches: 0
Planning Time: 1.025 ms
Execution Time: 0.287 ms
(10 rows)
The reason is that even with a freshly analyzed table, the optimizer’s estimate is worse than before: it predicts fewer rows (rows=23) when there are actually more (rows=75). This can happen because a predicate such as a < 10 AND b < 10 is already complex for the cost-based optimizer. Due to these misestimates, the inflection point was estimated higher (rows=78), so the optimizer still chose a nested loop plan.
What I like about this feature—which I’ve been familiar with since it was implemented in Oracle Database years ago—is that it doesn’t try to find the absolute best plan. Instead, it focuses on avoiding the worst plans (for example, nested loops on tens or hundreds of thousands of rows) and switching to a plan that is simply good enough. Amazon Aurora is a black box with limited tracing, so it’s difficult to know exactly how it works, but it probably behaves similarly to Oracle adaptive plans. I wrote an older blog post about how Oracle determines the inflection point:
As this requires more work at execution time, Aurora triggers it only when the cost of the first join method is higher than a threshold, by default 100:
postgres=> \dconfig apg*adaptive_join*
List of configuration parameters
Parameter | Value
----------------------------------------+-------
apg_adaptive_join_cost_threshold | 100
apg_adaptive_join_crossover_multiplier | 1
apg_enable_parameterized_adaptive_join | on
(3 rows)
In my examples, the costs were higher than apg_adaptive_join_cost_threshold (Nested Loop (cost=0.57..124.76). I had to enable apg_enable_parameterized_adaptive_join because the join predicate is pushed down as a parameter (Index Cond: (id = o.id)), which is the main advantage of a nested loop join as it enables index access to the inner table. I've set the apg_adaptive_join_crossover_multiplier to enable the feature. Setting a higher value simply raises the inflection point by multiplying the crossover value, which reduces the likelihood of an adaptive plan being triggered.
To test it further, I modified my data so that the outer table returns 50,000 rows and run the query again:
postgres=> UPDATE outer_table SET a=0 , b=0;
UPDATE 50000
postgres=> explain (analyze)
SELECT o.b,i.b
FROM outer_table o
JOIN inner_table i USING(id)
WHERE o.a<10 AND o.b<10
;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (Adaptive) (cost=180.21..361.46 rows=5 width=8) (actual time=231.245..234.250 rows=10000 loops=1)
Adaptive Crossover: rows=139
Hash Cond: (i.id = o.id)
-> Seq Scan on inner_table i (cost=0.00..155.00 rows=10000 width=8) (actual time=0.014..0.633 rows=10000 loops=1)
-> Hash (cost=179.65..179.65 rows=45 width=8) (actual time=231.240..231.240 rows=50000 loops=1)
-> Index Only Scan using outer_table_a_b_id_idx on outer_table o (cost=0.42..179.65 rows=45 width=8) (actual time=1.641..90.167 rows=50000 loops=1)
Index Cond: ((a < 10) AND (b < 10))
Heap Fetches: 50075
Planning Time: 1.041 ms
Execution Time: 298.893 ms
(10 rows)
The adaptive plan avoided a nested loop join that would have required 50,000 iterations. With accurate statistics, the optimizer probably would have selected a merge join instead, since it needs to read all tables and I have indexes on the join keys. In that case, a merge join would have been faster than a hash join. That said, even with stale statistics, the hash join was still much better — or at least less risky — than using a nested loop join.
Without this feature in PostgreSQL, you still have options: ensure statistics are accurate (this remains true even with adaptive plans, which rely on estimates) and use extended statistics where they help. Make sure you have the right indexes so that cost differences are clear and the planner does not hesitate between two bad plans. You can use pg_hint_plan to force a specific join method, though it often needs more hints than expected (see Predictable plans with pg_hint_plan full hinting). Some people tweak random_page_cost, which affects index costs and thus join choices, but I have my own ideas about that. Because joins are central in SQL databases due to relational normalization, a poor join method can make them seem slow and unpredictable, so it is crucial to understand join strategies and review execution plans carefully. This feature in Aurora helps prevent some runaway queries, so I think it is a good idea to enable it by default, especially given that you can set a crossover multiplier to have it kick in only to avoid the worst cases.
December 23, 2025
Unnesting Scalar Subqueries into Left Outer Joins in SQL
Relational databases treat your query as a declarative description of the desired result and select the most efficient execution plan. They may rewrite the query—for example, by transforming subqueries into joins and vice versa—so the database, not the developer, manages this complexity.
Historically, PostgreSQL has had fewer planner transformations than many commercial databases. As an open-source project, it favors simpler code that promotes contributions and good SQL design, while commercial vendors can justify more complex planning logic when it helps revenue-generating customers in reducing their optimization efforts. PostgreSQL does not maintain a global shared plan cache, so most queries are planned per execution. This encourages keeping planning overhead low. The only exceptions are when using prepared statements, functions, or other situations where a plan can be generic, retained and reused. Transformations open more access paths, potentially improving execution plans, but at the cost of higher planning time.
AWS recently added these kinds of transformations to Amazon Aurora PostgreSQL (APG) to optimize queries with a correlated subquery in the projection:
Without this transformation, projection expressions are evaluated per row, which at best yields nested-loop–like performance. By pushing the subquery down so it runs before the projection, more efficient join algorithms can be used. Amazon Aurora is a proprietary fork of PostgreSQL, and its improvements are not contributed upstream, so this feature is not available in PostgreSQL. Nonetheless, examining the transformation Aurora implements can inspire similar manual rewrites of queries. I’ll use the example from the AWS blog post to show how to rewrite the query more efficiently. It is important to understand the transformation carefully to ensure it does not change the results.
Here is the DDL and DML to create the same example as AWS's article:
-- Clean up from previous runs
DROP TABLE IF EXISTS outer_table;
DROP TABLE IF EXISTS inner_table;
-- Outer table (like customers)
CREATE TABLE outer_table (
id SERIAL PRIMARY KEY,
a INT,
b INT
);
-- Inner table (like orders)
CREATE TABLE inner_table (
id SERIAL PRIMARY KEY,
a INT,
b INT
);
-- Insert data into inner_table:
-- 10,000 rows, 'a' cycles from 1..100, 'b' random 0–999
INSERT INTO inner_table (a, b)
SELECT
1 + mod(gs - 1, 100),
floor(random() * 1000)::int
FROM generate_series(1, 10000) AS gs;
-- Insert outer_table:
-- First 25K rows: a = 1..100 repeated
INSERT INTO outer_table (a, b)
SELECT
1 + mod(gs - 1, 100),
floor(random() * 1000)::int
FROM generate_series(1, 25000) AS gs;
-- Next 25K rows: unique a = 101..35100
INSERT INTO outer_table (a, b)
SELECT
gs + 100,
floor(random() * 1000)::int
FROM generate_series(1, 25000) AS gs;
-- Gather statistics
ANALYZE outer_table;
ANALYZE inner_table;
I've set up an Aurora Serverless database to test it by enabling the transformation parameter and running the query, displaying the execution plan:
SET apg_enable_correlated_scalar_transform = on;
explain (analyze , verbose, costs off)
SELECT outer_table.a, outer_table.b
, (SELECT AVG(inner_table.b)
FROM inner_table
WHERE inner_table.a = outer_table.a
) FROM outer_table
;
QUERY PLAN
------------------------------------------------------------------------------------------------
Hash Left Join (actual time=4.904..15.740 rows=50000 loops=1)
Output: outer_table.a, outer_table.b, (avg(inner_table.b))
Inner Unique: true
Hash Cond: (outer_table.a = inner_table.a)
-> Seq Scan on public.outer_table (actual time=0.016..2.968 rows=50000 loops=1)
Output: outer_table.id, outer_table.a, outer_table.b
-> Hash (actual time=2.985..2.986 rows=100 loops=1)
Output: (avg(inner_table.b)), inner_table.a
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> HashAggregate (actual time=2.930..2.960 rows=100 loops=1)
Output: avg(inner_table.b), inner_table.a
Group Key: inner_table.a
Batches: 1 Memory Usage: 32kB
-> Seq Scan on public.inner_table (actual time=0.016..0.637 rows=10000 loops=1)
Output: inner_table.id, inner_table.a, inner_table.b
Query Identifier: -2382945993278526738
Planning Time: 2.439 ms
Execution Time: 23.322 ms
(18 rows)
I've added the verbose option to explain because it is important to see all elements of the rewritten query. Here, with the transformation, (SELECT AVG(inner_table.b) FROM ... WHERE ... ) has been transformed to (avg(inner_table.b)) over a Hash Left Join to the deduplicated (HashAggregate) inner table.
It is the equivalent of:
explain (analyze , verbose, costs off )
SELECT outer_table.a, outer_table.b
, agg.avg_b
FROM outer_table
LEFT JOIN (
SELECT a, AVG(b) AS avg_b
FROM inner_table
GROUP BY a
) AS agg
ON outer_table.a = agg.a
;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Hash Left Join (actual time=4.469..16.534 rows=50000 loops=1)
Output: outer_table.a, outer_table.b, agg.avg_b
Inner Unique: true
Hash Cond: (outer_table.a = agg.a)
-> Seq Scan on public.outer_table (actual time=0.011..3.124 rows=50000 loops=1)
Output: outer_table.id, outer_table.a, outer_table.b
-> Hash (actual time=3.804..3.806 rows=100 loops=1)
Output: agg.avg_b, agg.a
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Subquery Scan on agg (actual time=3.733..3.778 rows=100 loops=1)
Output: agg.avg_b, agg.a
-> HashAggregate (actual time=3.732..3.765 rows=100 loops=1)
Output: inner_table.a, avg(inner_table.b)
Group Key: inner_table.a
Batches: 1 Memory Usage: 32kB
-> Seq Scan on public.inner_table (actual time=0.004..0.668 rows=10000 loops=1)
Output: inner_table.id, inner_table.a, inner_table.b
Query Identifier: -3523129028670016640
Planning Time: 1.361 ms
Execution Time: 19.674 ms
(20 rows)
This looks simple, but SQL can be weird and not all aggregation functions have the same semantic with the absence of values. If there are no rows from the inner table, AVG() returns a NULL, like the correlated subquery would return in the absence of rows. However, if a COUNT() was used instead of AVG() it would be different, as a subquery still returns no rows but a COUNT() would return 0.
I test the automatic transformation on Aurora with a COUNT():
explain (analyze , verbose, costs off)
SELECT outer_table.a, outer_table.b
, (SELECT COUNT(inner_table.b)
FROM inner_table
WHERE inner_table.a = outer_table.a
) FROM outer_table
;
QUERY PLAN
------------------------------------------------------------------------------------------------
Hash Left Join (actual time=2.319..13.332 rows=50000 loops=1)
Output: outer_table.a, outer_table.b, COALESCE((count(inner_table.b)), 0)
Inner Unique: true
Hash Cond: (outer_table.a = inner_table.a)
-> Seq Scan on public.outer_table (actual time=0.012..3.003 rows=50000 loops=1)
Output: outer_table.id, outer_table.a, outer_table.b
-> Hash (actual time=2.302..2.304 rows=100 loops=1)
Output: (count(inner_table.b)), inner_table.a
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> HashAggregate (actual time=2.255..2.268 rows=100 loops=1)
Output: count(inner_table.b), inner_table.a
Group Key: inner_table.a
Batches: 1 Memory Usage: 24kB
-> Seq Scan on public.inner_table (actual time=0.003..0.640 rows=10000 loops=1)
Output: inner_table.id, inner_table.a, inner_table.b
Query Identifier: 6903753335662751945
Planning Time: 1.267 ms
Execution Time: 15.219 ms
(18 rows)
Now the VERBOSE option shows COALESCE((count(inner_table.b)), 0) to transform a NULL into a zero, in order to match the COUNT semantic.
Here is the equivalent query if you want to do the transformation manually:
explain (analyze , verbose )
SELECT outer_table.a, outer_table.b
, COALESCE(agg.cnt_b, 0) AS cnt_b
FROM outer_table
LEFT JOIN (
SELECT a, COUNT(b) AS cnt_b
FROM inner_table
GROUP BY a
) AS agg
ON outer_table.a = agg.a;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=208.25..1110.53 rows=50000 width=16) (actual time=2.556..77.785 rows=50000 loops=1)
Output: outer_table.a, outer_table.b, COALESCE(agg.cnt_b, '0'::bigint)
Inner Unique: true
Hash Cond: (outer_table.a = agg.a)
-> Seq Scan on public.outer_table (cost=0.00..771.00 rows=50000 width=8) (actual time=0.012..3.183 rows=50000 loops=1)
Output: outer_table.id, outer_table.a, outer_table.b
-> Hash (cost=207.00..207.00 rows=100 width=12) (actual time=2.532..2.534 rows=100 loops=1)
Output: agg.cnt_b, agg.a
Buckets: 1024 Batches: 1 Memory Usage: 13kB
-> Subquery Scan on agg (cost=205.00..207.00 rows=100 width=12) (actual time=2.485..2.509 rows=100 loops=1)
Output: agg.cnt_b, agg.a
-> HashAggregate (cost=205.00..206.00 rows=100 width=12) (actual time=2.484..2.497 rows=100 loops=1)
Output: inner_table.a, count(inner_table.b)
Group Key: inner_table.a
Batches: 1 Memory Usage: 24kB
-> Seq Scan on public.inner_table (cost=0.00..155.00 rows=10000 width=8) (actual time=0.006..0.679 rows=10000 loops=1)
Output: inner_table.id, inner_table.a, inner_table.b
Query Identifier: 4982770911819576582
Planning Time: 0.151 ms
Execution Time: 80.622 ms
(20 rows)
You need COALESCE with COUNT in the manual join rewrite because of how SQL aggregates behave: unlike other aggregates, COUNT returns 0—not NULL—when there are no rows.
Aurora PostgreSQL’s apg_enable_correlated_scalar_transform shows how the planner can rewrite correlated subqueries in the SELECT list into join-based aggregates, replacing per-row subqueries with set-based aggregation and hash joins for much better performance.
Even if you don’t use Aurora in production, you can spin up a small Aurora Serverless instance to validate your manual rewrites. Enable the parameter and compare the execution plan to your PostgreSQL version. The plans should match, with one structural difference: the manual version’s grouped subquery is wrapped in a Subquery Scan, which is expected in hand-written SQL.
For aggregates like AVG(), the rewrite preserves semantics with no extra changes. For COUNT(), wrap the join output in COALESCE(..., 0) so it behaves like the original correlated subquery when no rows are found.
By understanding these transformations and their semantics, you can reproduce Aurora’s optimization in upstream PostgreSQL and gain similar performance benefits while keeping full control over correctness.
Kubernetes Multi-Cloud Architecture: Building Portable Databases Without Lock-In
December 22, 2025
Memory Management in MongoDB 8.0: Testing the New TCMalloc
December 21, 2025
JSONB vs. BSON: Tracing PostgreSQL and MongoDB Wire Protocols
There is an essential difference between MongoDB’s BSON and PostgreSQL’s JSONB. Both are binary JSON formats, but they serve different roles. JSONB is purely an internal storage format for JSON data in PostgreSQL. BSON, on the other hand, is MongoDB’s native data format: it is used by application drivers, over the network, in memory, and on disk.
JSONB: PostgreSQL internal storage format
JSONB is a storage format, as defined by the PostgreSQL documentation:
PostgreSQL offers two types for storing JSON data: json and jsonb
PostgreSQL uses JSONB solely for internal storage, requiring the entire structure to be read to access a field, as observed in JSONB DeTOASTing (read amplification).
BSON: MongoDB storage and exchange format
BSON is used for storage and also as an exchange format between the application and the database, as defined in the BSON specification:
BSON can be compared to binary interchange formats, like Protocol Buffers. BSON is more "schema-less" than Protocol Buffers
On the application side, the MongoDB driver converts application objects to BSON, which supports more data types than JSON or JSONB, including datetime and binary. This BSON is sent and received over the network and stored and manipulated on the server as-is, with no extra parsing. Both the driver and the database can efficiently access fields via the binary structure because BSON includes metadata such as field length prefixes and explicit type information, even for large or nested documents.
PostgreSQL protocol is JSON (text), not JSONB
To illustrate this, I've written a small Python program that inserts a document into a PostgreSQL table with a JSONB column, and queries that table to retrieve the document:
from sqlalchemy import Column, Integer, create_engine
from sqlalchemy.dialects.postgresql import JSONB
from sqlalchemy.orm import declarative_base, sessionmaker
Base = declarative_base()
class Item(Base):
__tablename__ = 'items'
id = Column(Integer, primary_key=True)
data = Column(JSONB) # our JSONB column
# Connect to Postgres
engine = create_engine('postgresql+psycopg2://', echo=True)
Session = sessionmaker(bind=engine)
session = Session()
# Create table
Base.metadata.create_all(engine)
# Insert an object into JSONB column
obj = {"name": "widget", "price": 9.99, "tags": ["new", "sale"]}
session.add(Item(data=obj))
session.commit()
# Read back the table
for row in session.query(Item).all():
print(row.id, row.data)
The program uses SQLAlchemy to send and retrieve Python objects to and from PostgreSQL via the Psycopg2 driver. I've stored it in demo.py.
When I run the program, with python demo.py, before it displays the final result, it logs all SQL statements:
2025-12-21 12:50:22,484 INFO sqlalchemy.engine.Engine select pg_catalog.version()
2025-12-21 12:50:22,485 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine select current_schema()
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine show standard_conforming_strings
2025-12-21 12:50:22,486 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-12-21 12:50:22,487 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,488 INFO sqlalchemy.engine.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where pg_catalog.pg_table_is_visible(c.oid) and relname=%(name)s
2025-12-21 12:50:22,488 INFO sqlalchemy.engine.Engine [generated in 0.00015s] {'name': 'items'}
2025-12-21 12:50:22,489 INFO sqlalchemy.engine.Engine
CREATE TABLE items (
id SERIAL NOT NULL,
data JSONB,
PRIMARY KEY (id)
)
2025-12-21 12:50:22,489 INFO sqlalchemy.engine.Engine [no key 0.00011s] {}
2025-12-21 12:50:22,491 INFO sqlalchemy.engine.Engine COMMIT
2025-12-21 12:50:22,493 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,494 INFO sqlalchemy.engine.Engine INSERT INTO items (data) VALUES (%(data)s) RETURNING items.id
2025-12-21 12:50:22,494 INFO sqlalchemy.engine.Engine [generated in 0.00018s] {'data': '{"name": "widget", "price": 9.99, "tags": ["new", "sale"]}'}
2025-12-21 12:50:22,495 INFO sqlalchemy.engine.Engine COMMIT
2025-12-21 12:50:22,497 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 12:50:22,498 INFO sqlalchemy.engine.Engine SELECT items.id AS items_id, items.data AS items_data
FROM items
2025-12-21 12:50:22,498 INFO sqlalchemy.engine.Engine [generated in 0.00013s] {}
1 {'name': 'widget', 'tags': ['new', 'sale'], 'price': 9.99}
To see what is sent and received through the network by the PostgreSQL protocol, I run the program with strace, showing the sendto and recv system calls with their arguments: strace -e trace=sendto,recvfrom -yy -s 1000 python demo.py.
Like most SQL database drivers, the protocol is basic: send SQL commands as text, and fetch a tabular result set. In the PostgreSQL protocol's messages, the first letter is the message type (Q for Simple Query Message, followed by the length of the message, and the message in text, X to terminate the session, C for command completion status, T abd D for the resultset).
Here is the output, the lines starting with the timestamp are the logs from SQL Alchemy, those starting with sendto and recv are the network system calls with the message to the database, and the result from it
Where is the trace when inserting one document:
2025-12-21 16:52:20,278 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 16:52:20,279 INFO sqlalchemy.engine.Engine INSERT INTO items (data) VALUES (%(data)s) RETURNING items.id
2025-12-21 16:52:20,279 INFO sqlalchemy.engine.Engine [generated in 0.00029s] {'data': '{"name": "widget", "price": 9.99, "tags": ["new", "sale"]}'}
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\nBEGIN\0", 11, MSG_NOSIGNAL, NULL, 0) = 11
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0vINSERT INTO items (data) VALUES ('{\"name\": \"widget\", \"price\": 9.99, \"tags\": [\"new\", \"sale\"]}') RETURNING items.id\0", 119, MSG_NOSIGNAL, NULL, 0) = 119
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "T\0\0\0\33\0\1id\0\0\0@\310\0\1\0\0\0\27\0\4\377\377\377\377\0\0D\0\0\0\v\0\1\0\0\0\0011C\0\0\0\17INSERT 0 1\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 62
2025-12-21 16:52:20,281 INFO sqlalchemy.engine.Engine COMMIT
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\vCOMMIT\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\vCOMMIT\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 18
It started a transaction (Q\0\0\0\nBEGIN), received command completion (C\0\0\0\nBEGIN), then sent the full text of the INSERT command, including the JSON payload (Q\0\0\0vINSERT INTO items (data) VALUES ('{\"name\": \"widget\", \"price\": 9.99, \"tags\": [\"new\", \"sale\"]}). It subsequently received command completion (INSERT 0 1) and the returned ID (T\0\0\0\33\0\1id, D\0\0\0\v\0\1\0\0\0\001).
Here is the trace when I query and fetch the document:
2025-12-21 16:52:20,283 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-12-21 16:52:20,285 INFO sqlalchemy.engine.Engine SELECT items.id AS items_id, items.data AS items_data
FROM items
2025-12-21 16:52:20,285 INFO sqlalchemy.engine.Engine [generated in 0.00024s] {}
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\nBEGIN\0", 11, MSG_NOSIGNAL, NULL, 0) = 11
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\nBEGIN\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 17
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0FSELECT items.id AS items_id, items.data AS items_data \nFROM items\0", 71, MSG_NOSIGNAL, NULL, 0) = 71
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "T\0\0\0>\0\2items_id\0\0\0@\310\0\1\0\0\0\27\0\4\377\377\377\377\0\0items_data\0\0\0@\310\0\2\0\0\16\332\377\377\377\377\377\377\0\0D\0\0\0I\0\2\0\0\0\0011\0\0\0:{\"name\": \"widget\", \"tags\": [\"new\", \"sale\"], \"price\": 9.99}C\0\0\0\rSELECT 1\0Z\0\0\0\5T", 16384, 0, NULL, NULL) = 157
It started another transaction, sent the SELECT statement as text and received the result as JSON text (D\0\0\0I\0\2\0\0\0\0011\0\0\0:{\"name\": \"widget\", \"tags\": [\"new\", \"sale\"], \"price\": 9.99}).
Finally, the transaction ends, and the sessionis disconnected:
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "Q\0\0\0\rROLLBACK\0", 14, MSG_NOSIGNAL, NULL, 0) = 14
recvfrom(3<TCPv6:[[::1]:41858->[::1]:5432]>, "C\0\0\0\rROLLBACK\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 20
sendto(3<TCPv6:[[::1]:41858->[::1]:5432]>, "X\0\0\0\4", 5, MSG_NOSIGNAL, NULL, 0) = 5
If you want to dig into the code, the server-side parsing is in jsonb_send and jsonb_recv ("The type is sent as text in binary mode"), and while it tests the version before converting to text, there's only one version. The client-side for Psycopg2 shows that register_default_jsonb is the same as register_default_json
Comparing with MongoDB
To compare with MongoDB, created the following demo-mongodb.py:
from pymongo import MongoClient
client = MongoClient("mongodb://127.0.0.1:27017")
db = client.my_database
insert_result = db.items.insert_one({"name": "widget", "price": 9.99, "tags": ["new", "sale"]})
print("Inserted document ID:", insert_result.inserted_id)
for doc in items_collection.find():
print(doc["_id"], doc)
I used the same strace command, but displaying all characters as hexadecimal to be able to decode them with bsondump:
$ strace -e trace=sendto,recvfrom -xx -yy -s 1000 python demo-mongodb.py 2>&1
Here is the network request for the insert:
sendto(5<TCP:[127.0.0.1:44570->127.0.0.1:27017]>, "\xd6\x00\x00\x00\x51\xdc\xb0\x74\x00\x00\x00\x00\xdd\x07\x00\x00\x00\x00\x00\x00\x00\x5a\x00\x00\x00\x02\x69\x6e\x73\x65\x72\x74\x00\x06\x00\x00\x00\x69\x74\x65\x6d\x73\x00\x08\x6f\x72\x64\x65\x72\x65\x64\x00\x01\x03\x6c\x73\x69\x64\x00\x1e\x00\x00\x00\x05\x69\x64\x00\x10\x00\x00\x00\x04\x31\xb8\x9a\x81\xfd\x35\x42\x1a\x88\x44\xa8\x69\xe8\xba\x6f\x30\x00\x02\x24\x64\x62\x00\x0c\x00\x00\x00\x6d\x79\x5f\x64\x61\x74\x61\x62\x61\x73\x65\x00\x00\x01\x66\x00\x00\x00\x64\x6f\x63\x75\x6d\x65\x6e\x74\x73\x00\x58\x00\x00\x00\x07\x5f\x69\x64\x00\x69\x48\x3f\x7f\x87\x46\xd5\x2e\xe2\x0b\xbc\x0b\x02\x6e\x61\x6d\x65\x00\x07\x00\x00\x00\x77\x69\x64\x67\x65\x74\x00\x01\x70\x72\x69\x63\x65\x00\x7b\x14\xae\x47\xe1\xfa\x23\x40\x04\x74\x61\x67\x73\x00\x1c\x00\x00\x00\x02\x30\x00\x04\x00\x00\x00\x6e\x65\x77\x00\x02\x31\x00\x05\x00\x00\x00\x73\x61\x6c\x65\x00\x00\x00", 214, 0, NULL, 0) = 214
recvfrom(5<TCP:[127.0.0.1:44570->127.0.0.1:27017]>, "\x2d\x00\x00\x00\x06\x00\x00\x00\x51\xdc\xb0\x74\xdd\x07\x00\x00", 16, 0, NULL, NULL)<... (truncated)
December 20, 2025
IO-bound sysbench vs MySQL on a 48-core server
This has results for an IO-bound sysbench benchmark on a 48-core server for MySQL versions 5.6 through 9.5. Results from a CPU-bound sysbench benchmark on the 48-core server are here.
tl;dr
- the regressions here on read-only tests are smaller than on the CPU bound workload, but when they occur are from new CPU overheads
- the large improvements here on write-heavy tests are similar to the CPU bound workload
The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 250M rows per table. With 250M rows per table this is IO-bound. I normally use 10M rows per table for CPU-bound workloads.
I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for MySQL 5.6.51)
- the improvement for hot-points is similar to the CPU-bound results
- the regressions here for the IO-bound tests are smaller than for the CPU-bound results
- the regression in point-query is from new CPU overhead, see cpu/o here which is 1.37X larger in 9.5.0 vs 5.6.51
- the regression in points-covered-si is from new CPU overhead, see cpu/o here which is 1.24X larger in 9.5.0 vs 5.6.51. This test is CPU-bound, the queries don't do IO because the secondary indexes are cached.
- the regressions here for the IO-bound tests are smaller than for the CPU-bound results, except for the scan test
- the regressions in scan are from new CPU overhead, see cpu/o here, which is 1.38X larger in 9.5.0 vs 5.6.51
- the regressions here for the IO-bound tests are smaller than for the CPU-bound results
- the regressions in read-only-count are from new CPU overhead, see cpu/o here, which is 1.25X larger in 9.5.0 vs 5.6.51
- the improvements here for the IO-bound tests are similar to the CPU-bound results
- the largest improvement, for the update-index test, is from using less CPU, fewer context switches, less read IO and less write IO per operation -- see cpu/o, cs/o, rKB/o and wKB/o here
December 19, 2025
Atlas Search score details (the BM25 calculation)
With @james_blackwoodsewell_58 we were comparing the BM25 text search scores between MongoDB Atlas (Lucene), ElasticSearch (Lucene) and ParadeDB (using Tantivy) which provide the same ordering but MongoDB Atlas shows constantly a lower score by a factor of 2.2:
It was the occasion for me to look at the score details which gives the calculation details for the score.
Test case
I've built the same test case as in my previous blog:
db.articles.drop();
db.articles.deleteMany({});
db.articles.insertMany([
{ description : "🍏 🍌 🍊" }, // short, 1 🍏
{ description : "🍎 🍌 🍊" }, // short, 1 🍎
{ description : "🍎 🍌 🍊 🍎" }, // larger, 2 🍎
{ description : "🍎 🍌 🍊 🍊 🍊" }, // larger, 1 🍎
{ description : "🍎 🍌 🍊 🌴 🫐 🍈 🍇 🌰" }, // large, 1 🍎
{ description : "🍎 🍎 🍎 🍎 🍎 🍎" }, // large, 6 🍎
{ description : "🍎 🍌" }, // very short, 1 🍎
{ description : "🍌 🍊 🌴 🫐 🍈 🍇 🌰 🍎" }, // large, 1 🍎
{ description : "🍎 🍎 🍌 🍌 🍌" }, // shorter, 2 🍎
]);
db.articles.createSearchIndex("default",
{ mappings: { dynamic: true } }
);
Score with details
I ran the same query, adding scoreDetails: true to the search stage, and scoreDetails: { $meta: "searchScoreDetails" } } to the projection stage:
db.articles.aggregate([
{
$search: {
text: { query: ["🍎", "🍏"], path: "description" },
index: "default",
scoreDetails: true
}
},
{ $project: {
_id: 0, description: 1,
score: { $meta: "searchScore" },
scoreDetails: { $meta: "searchScoreDetails" } } },
{ $sort: { score: -1 } } ,
{ $limit: 1 }
])
Here is the result:
mdb> db.articles.aggregate([
... {
... $search: {
... text: { query: ["🍎", "🍏"], path: "description" },
... index: "default",
... scoreDetails: true
... }
... },
... { $project: { _id: 0, description: 1, score: { $meta: "searchScore" }, scoreDetails: { $meta: "searchScoreDetails" } } },
... { $sort: { score: -1 } } ,
... { $limit: 1 }
... ])
[
{
description: '🍏 🍌 🍊',
score: 1.0242118835449219,
scoreDetails: {
value: 1.0242118835449219,
description: 'sum of:',
details: [
{
value: 1.0242118835449219,
description: '$type:string/description:🍏 [BM25Similarity], result of:',
details: [
{
value: 1.0242118835449219,
description: 'score(freq=1.0), computed as boost * idf * tf from:',
details: [
{
value: 1.8971199989318848,
description: 'idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:',
details: [
{
value: 1,
description: 'n, number of documents containing term',
details: []
},
{
value: 9,
description: 'N, total number of documents with field',
details: []
}
]
},
{
value: 0.5398772954940796,
description: 'tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:',
details: [
{
value: 1,
description: 'freq, occurrences of term within document',
details: []
},
{
value: 1.2000000476837158,
description: 'k1, term saturation parameter',
details: []
},
{
value: 0.75,
description: 'b, length normalization parameter',
details: []
},
{
value: 3,
description: 'dl, length of field',
details: []
},
{
value: 4.888888835906982,
description: 'avgdl, average length of field',
details: []
}
]
}
]
}
]
}
]
}
}
]
So all is there. Here is the scoring breakdown for "🍏 🍌 🍊", which produced a score of 1.0242118835449219.
IDF calculation (inverse document frequency)
Search result:
- Number of documents containing the term:
n = 1 - Total number of documents with this field:
N = 9
idf = log(1 + (N - n + 0.5) / (n + 0.5))
= log(1 + (9 - 1 + 0.5) / (1 + 0.5))
= log(6.666666666666667)`
≈ 1.8971199989318848
TF calculation (term frequency)
Parameters are the Lucene defaults:
- Term saturation parameter:
k1 = 1.2000000476837158 - Length normalization parameter:
b = 0.75
Document field statistics:
- Average length of the field:
avgdl = 44 / 9 ≈ 4.888888835906982 - Occurrences of the term in this document:
freq = 1
tf = freq / (freq + k1 * (1 - b + b * dl / avgdl))
= 1 / (1 + 1.2000000476837158 × (0.25 + 0.75 × (3 / 4.888888835906982)))
≈ 0.5398772954940796
Final score
Parameter:
- Boost:
1.0
score = boost × idf × tf
= 1.0 × 1.8971199989318848 × 0.5398772954940796
≈ 1.0242118835449219
That confirms that Atlas Search uses the same scoring as Lucene https://github.com/apache/lucene/blob/releases/lucene/10.3.2/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L183
What about ElasticSearch and Tantivy
Eight years ago, Lucene removed the (k1 + 1) factor in LUCENE-8563. For k1 = 1.2, this change reduces the score by a factor of 2.2 from that version onward. Tantivy and Elasticsearch apparently still use the old formula, while Atlas Search uses the updated one, which explains the observed differences in scoring.
Conclusion
MongoDB Atlas Search indexes use the same scoring mechanism as Lucene indexes. When you compare them with other text search engines that also use Lucene, you might notice a score difference by a factor of about 2.2. This doesn’t actually matter, because scores are used only for ordering results, and the relative ordering remains the same.
Text search scores can seem magical, but they are fully deterministic and based on open-source formulas. In MongoDB, you can include the score details option in a text search query to inspect all the parameters and formulas behind the score.
Improve Developer Velocity with Kubernetes Databases
LeaseGuard: Raft Leases Done Right!
Many distributed systems have a leader-based consensus protocol at their heart. The protocol elects one server as the "leader" who receives all writes. The other servers are "followers", hot standbys who replicate the leader’s data changes. Paxos and Raft are the most famous leader-based consensus protocols.
These protocols ensure consistent state machine replication, but reads are still tricky. Imagine a new leader L1 is elected, while the previous leader L0 thinks it's still in charge. A client might write to L1, then read stale data from L0, violating Read Your Writes. How can we prevent stale reads? The original Raft paper recommended that the leader communicate with a majority of followers before each read, to confirm it's the real leader. This guarantees Read Your Writes but it's slow and expensive.
A leader lease is an agreement among a majority of servers that one server will be the only leader for a certain time. This means the leader can run queries without communicating with the followers, and still ensure Read Your Writes. The original description of Raft included a lease protocol that was inherited from the earlier Paxos, where followers refuse to vote for a new leader until the old leader's lease expires. This entangles leases and elections, and it delays recovery after a crash. Besides, lease protocols have never been specified in detail, for either Raft or Paxos. For all these reasons, many Raft implementations don't use leases at all, or their leases are buggy.
In the MongoDB Distributed Systems Research Group, we designed a simple lease protocol tailored for Raft, called LeaseGuard. Our main innovation is to rely on Raft-specific guarantees to design a simpler lease protocol that recovers faster from a leader crash. Here’s a preprint of our SIGMOD'26 paper. This is a joint blog post by A. Jesse Jiryu Davis and Murat Demirbas, published on both of our blogs.
A huge simplification: the log is the lease
LeaseGuard makes leader failovers smoother and faster
| Transitions in the read/write availability of leaders with LeaseGuard. While the new leader waits for a lease, it can serve some consistent reads and accept writes. |
Tests and benchmarks
- Inconsistent: LogCabin running fast and loose, with no guarantee of Read Your Writes.
- Quorum: The default Read Your Writes mechanism, where the leader talks to a majority of followers before running each query, is miserably slow—notice that its Y axis is one tenth as high as the other charts!
- Lease: The “log is the lease” protocol with no optimizations. Its throughput is as high as “inconsistent”, but it has a long time to recovery after the old leader crashes.
- Defer commit: The log is the lease, plus our write optimization—you can see that write throughput spikes off the chart at 1500 ms, because the leader has been processing writes while waiting for the lease. As soon as it gets the lease, it commits all the writes at once.
- Inherit lease: LeaseGuard with all our optimizations. Read throughput recovers as soon as a new leader is elected, without waiting for the old lease to expire.