June 26, 2025
Scaling Smarter: What You Have Missed in MongoDB 8.0
June 25, 2025
Building a job search engine with PostgreSQL’s advanced search features
Using Percona Everest Operator CRDs to Manage Databases in Kubernetes
Build a Personalized AI Assistant with Postgres
Use CedarDB to search the CedarDB docs and blogs
Motivation
Not so long ago, I shared that I have an interest in finding things and, in that case, the question was about where something could be found. Another common requrement is, given some expression of an interest, finding the set of documents that best answers the question. For example, coupled with the geospatial question, we might include that we’re looking for Indian restaurants within the specified geographic area.
For this article, though, we’ll restrict the focus to the problem of finding the most relevant documents within some collection, where that collection just happens to be the CedarDB documentation. To that end, I’ll assert up front that my query “Does the CedarDB ‘asof join’ use an index?” should return a helpful response, while the query “Does pickled watermelon belong on a taco?” should ideally return an empty result.
June 24, 2025
No pre-filtering in pgvector means reduced ANN recall
AI applications are expanding rapidly, and PostgreSQL is a popular choice among relational databases. The pgvector extension, a third-party add-on, enhances PostgreSQL by introducing a high-dimensional vector data type with similarity operations and search indexing.
Integrating embeddings directly into general-purpose databases eliminates the need for a separate one. Typically, approximate searches on embeddings are performed alongside exact searches on various other attributes, SQL columns or document fields, such as metadata, dates, or other dimensions.
PostgreSQL offers various index types, but it has notable limitations when combining them, as we have seen in PostgreSQL JSONB Indexing Limitations with B-Tree and GIN. Likewise, pgvector encounters similar issues.
Some users have moved to MongoDB Atlas Vector Search because it offers pre-filtering capabilities. They had incomplete results with PostgreSQL pgvector when filtering with other predicates. To better understand the impact of lacking pre-filtering in such scenarios, I built this simple demo.
Setup PostgreSQL with pgvector
I started a pgvector container:
docker run --name pgv -d -e POSTGRES_PASSWORD=franck pgvector/pgvector:0.8.0-pg17
docker exec -it pgv psql -U postgres
I enable the extension:
create extension if not exists vector;
Importing a synthetic dataset
I create a function to generate a random vector:
create function random_embedding(dimensions int) returns vector as
$$
select
array(
select random()::real
from generate_series(1, dimensions)
)::vector
$$ language sql;
I create a table to store embeddings ("embedding") with some metadata ("color"):
create table embeddings_table (
id bigserial primary key,
color text,
embedding vector(512)
);
I inserted two million rows, each containing a randomly generated 512-dimensional vector, and assigned one of three colors as metadata:
insert into embeddings_table (embedding,color)
select random_embedding(512)
,(array['red', 'green', 'blue'])[generate_series(1, 2000000)%3+1]
;
I used the random() function to ensure data is evenly distributed. The points are positioned in a 512-dimensional space, with one-third of the rows assigned to each color. This is a synthetic dataset, which makes it easier to understand the result.
I create a vector index, HNSW (Hierarchical Navigable Small Worlds), on the embeddings, using cosine similarity:
create index i1 on embeddings_table
using hnsw ( embedding vector_cosine_ops )
;
Query example
I generated one more random vector to use in my queries:
select random_embedding(512)
\gset
postgres=# select :'random_embedding';
?column?
---------------------------------------------------------------------
[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]
The cosine similarity search will find the points in the table for which the angle to axis is close to the reference point that I've stored in :'random_embedding'
variable.
I want to query the 15 points that are most similar to this reference point, but only consider the green category.
The following query filters on green rows (where color='green'
), calculates the cosine similarity (embedding <=> :'random_embedding'
) and filters the nearest 15 points (order by nn_cosine limit 15
):
select id , color, embedding <=> :'random_embedding' nn_cosine
from embeddings_table
where color='green'
order by nn_cosine limit 15;
I used "nn_cosine" for the nearest neighbor cosine search. In future queries, I'll use "enn_cosine" or "ann_cosine" depending on whether I expect an exact or approximate result, from a full table scan or an index scan.
Embeddings have too many dimensions for us to visualize easily, but here's an analogy in our three-dimensional world. My dataset is like a soft ball pool with red, green, and blue balls, where each ball's position represents the meaning of the data. Cosine similarity search is akin to pointing a laser from the center of the pool to a reference point, which corresponds to the meaning we are looking for, and identifying balls whose positions form the smallest angles with the laser ray. Post-filtering searches all balls, then discards red and blue balls afterward. Pre-filtering considers only green balls when searching around the laser ray.
Exact Nearest Neighbors (ENN) with full scan
First, I disable the index to get an exact result:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
order by enn_cosine limit 15
;
id | color | enn_cosine
---------+-------+---------------------
1428352 | green | 0.19814620075833056
328933 | green | 0.2024464516951111
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
1328614 | green | 0.20373734017866685
327802 | green | 0.20464172025637872
238738 | green | 0.2048113256211399
969943 | green | 0.20566046923407266
1924495 | green | 0.2059847615560182
486043 | green | 0.20615577737388402
1911601 | green | 0.20652312839386933
1777339 | green | 0.20658742123960594
875029 | green | 0.20664456413189736
(15 rows)
Without an index, this query is slow because it calculates the distance for each row that meets the 'color' predicate, sorts them by this distance, and retrieves the Top-15 results, but it has the advantage of providing an exact result:
postgres=# explain (analyze, buffers, costs off, summary on)
select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
order by enn_cosine limit 15
;
QUERY PLAN
-------------------------------------------------------------------
Limit (actual time=1868.024..1878.636 rows=15 loops=1)
Buffers: shared hit=1989174 read=692354
-> Gather Merge (actual time=1868.022..1878.632 rows=15 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1989174 read=692354
-> Sort (actual time=1853.062..1853.063 rows=13 loops=3)
Sort Key: ((embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector))
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=1989174 read=692354
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on embeddings_table (actual time=0.126..1797.436 rows=222222 loops=3)
Filter: (color = 'green'::text)
Rows Removed by Filter: 444444
Buffers: shared hit=1989107 read=692347
Planning:
Buffers: shared read=1
Planning Time: 0.124 ms
Execution Time: 1878.658 ms
PostgreSQL utilized a parallel degree of 3. A filter on "color" reduced the number of rows to 222222 per process, resulting in 666666 rows. This filter eliminated 444444 in each worker, which accounts for two-thirds of the total rows. Each process calculated distances for its assigned rows and sorted them accordingly. Finally, the coordinator gathered the top 15 results from the worker processes.
Approximate Nearest Neighbors (ANN) with index
I enable the index to get a faster, but approximate, result:
postgres=# set enable_indexscan to on;
SET
postgres=# explain (analyze, buffers, costs off, summary on)
select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
where color='green'
order by ann_cosine limit 15
;
QUERY PLAN
-------------------------------------------------------------------
Limit (actual time=5.605..5.916 rows=11 loops=1)
Buffers: shared hit=84 read=1470
-> Index Scan using i1 on embeddings_table (actual time=5.604..5.912 rows=11 loops=1)
Order By: (embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector)
Filter: (color = 'green'::text)
Rows Removed by Filter: 29
Buffers: shared hit=84 read=1470
Planning:
Buffers: shared read=1
Planning Time: 0.089 ms
Execution Time: 5.934 ms
The index was used to retrieve rows in their cosine similarity order related to my reference point (Order By: (embedding <=> '[...]'::vector)
), but the search was limited to 40 candidates (the default hnsw.ef_search
). 21 rows were discarded by the metadata filter ((color = 'green'::text)
), leaving 11 rows remaining (rows=11
). Because of this, I didn't have enough candidates for the expected result (limit 15
) and I get less rows than expected:
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
where color='green'
order by ann_cosine limit 15
;
id | color | ann_cosine
---------+-------+---------------------
1875277 | green | 0.2076671534464677
222817 | green | 0.21016644773554916
1375990 | green | 0.2118530398002575
21538 | green | 0.21207386707694031
1355350 | green | 0.2121940467579876
505036 | green | 0.21220934429072225
1570702 | green | 0.21469847813732257
1997836 | green | 0.21482420378988654
1195270 | green | 0.21613844835346685
634417 | green | 0.2172001587963871
1995160 | green | 0.21794015870874028
(11 rows)
I used the default hnsw.iterative_scan
set to off
. Enabling iterative scan will get more results by running the scan again until there's enough candidate for limit 15
.
Compare ENN and ANN
The nearest green neighbor found by the index has a cosine similarity of 0.2076671534464677, but the exact search identified fifteen closer green points. This discrepancy highlights a characteristic of Hierarchical Navigable Small Worlds (HNSW). The index is constructed with multiple layers, beginning with sparse upper layers that serve as entry points. These layers help restrict the search scope as it descends into the dense bottom layer (layer 0). While this layered design accelerates search times, it can also miss closer neighbors in layer 0 that were not explored when descending from the upper layers.
Since all points in my dataset were generated using random() values, it lacks natural clustering. This diminishes routing efficiency and increases the likelihood of missing closer neighbors during approximate searches. Therefore, I am illustrating the worst-case scenario.
I checked how many results the approximate search missed by disabling the pgvector index. I found that 23 results were missed before the first row from exact search, which is the 24th row in the following:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
and embedding <=> :'random_embedding' <= 0.2076671534464677
order by enn_cosine
;
id | color | enn_cosine
---------+-------+---------------------
1428352 | green | 0.19814620075833056
328933 | green | 0.2024464516951111
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
1328614 | green | 0.20373734017866685
327802 | green | 0.20464172025637872
238738 | green | 0.2048113256211399
969943 | green | 0.20566046923407266
1924495 | green | 0.2059847615560182
486043 | green | 0.20615577737388402
1911601 | green | 0.20652312839386933
1777339 | green | 0.20658742123960594
875029 | green | 0.20664456413189736
593119 | green | 0.2066683273490607
1354432 | green | 0.20685417261064953
1898782 | green | 0.20697419915308368
1429552 | green | 0.20704169544999784
1293397 | green | 0.20746811422822542
1371502 | green | 0.20746937923342468
998884 | green | 0.2074836628885286
845659 | green | 0.20759016691317878
1875277 | green | 0.2076671534464677
(24 rows)
The measure for this approximation accuracy is called 'recall'. The definition from the MongoDB glossary is:
Recall measures the fraction of true nearest neighbors that were returned by an ANN search. This measure reflects how close the algorithm approximates the results of ENN search.
This approximation applies to all index searches, which are faster than full scans but may miss some closer neighbors. Post-filtering reduces recall even further, as some candidates are discarded, leading to the possibility of missing good matches. That's why it is better to use pre-filtering on large databases. Although pre-filtering is not available in PostgreSQL pgvector, we can analyze the data to understand its potential impact.
Post-filtering impact on recall
The problem is that pgvector lacks filtering capabilities in the index. As a result, it defaults to selecting 40 candidates, as defined by the ef_search
parameter, and filtering on more columns, like "color', reduces the result.
To explain this, I've run the query without the filter, showing the first 40 candidates of the three colors:
postgres=# set enable_indexscan to on;
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
-- where color='green'
order by ann_cosine limit 40 -- current_setting('hnsw.ef_search')
;
id | color | ann_cosine
---------+-------+---------------------
1360308 | red | 0.20296796169334463
1732580 | blue | 0.20459117837053364
1085082 | red | 0.20626595570441808
1875277 | green | 0.2076671534464677
656874 | red | 0.20930762441475093
504242 | blue | 0.20984078446453025
247001 | blue | 0.20995935279258404
222817 | green | 0.21016644773554916
705833 | blue | 0.2102792157006329
1966310 | blue | 0.21028852384517327
1503321 | red | 0.21044019511424406
480572 | blue | 0.21100294080666748
1375990 | green | 0.2118530398002575
21538 | green | 0.21207386707694031
1355350 | green | 0.2121940467579876
505036 | green | 0.21220934429072225
106983 | red | 0.21293893184210688
238458 | red | 0.21295064471740388
1008380 | blue | 0.21301481665902566
298931 | blue | 0.21304336639331967
1147263 | red | 0.21342607115241874
562604 | blue | 0.2135493812411281
1570702 | green | 0.21469847813732257
1997836 | green | 0.21482420378988654
1291274 | blue | 0.2159207514557735
1195270 | green | 0.21613844835346685
1035932 | blue | 0.21623180106532514
1010237 | blue | 0.2164365008134519
1256966 | blue | 0.21652825716033564
1748231 | blue | 0.21676377376711842
634417 | green | 0.2172001587963871
1685391 | red | 0.21723542532805584
1964021 | blue | 0.21723845625858207
1056446 | blue | 0.21757530726298147
958670 | blue | 0.21769898462687431
1558046 | blue | 0.2177076235462454
516734 | blue | 0.21777311307937175
1995160 | green | 0.21794015870874028
228096 | red | 0.21866579506700412
660161 | blue | 0.2187276449697918
(40 rows)
This is what the query using the index did in a first step. If you keep only the green rows, you get 11 rows. If it was filtered before, the index scan would have returned 40 green rows and the query would have been able to return the Top-15 from it. Post-filtering misses some good green candidates that were ignored because some others where selected.
It would have been better to pre-filter during the approximate nearest neighbor search to consider only 'green' neighbors, rather than discarding non-matching ones afterward based on a post-filter on 'color'. However, pgvector does not support such pre-filtering. The consequence is a low recall in a single index scan.
Without filter - ENN and ANN
For queries without a filter, recall is generally better because, although approximate searches may miss some points, all candidates returned by the index are included in the results. Here are the Top 15 most similar points across all colors:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
--where color='green'
order by enn_cosine limit 15;
id | color | enn_cosine
---------+-------+---------------------
1506704 | blue | 0.1943345774574703
1428352 | green | 0.19814620075833056
905583 | red | 0.1986930398354949
1887345 | red | 0.19958922153843262
1408551 | red | 0.20155542317891084
1761962 | blue | 0.20168765608150285
91001 | blue | 0.20206633541960917
328933 | green | 0.2024464516951111
493388 | blue | 0.20277316748365937
1360308 | red | 0.20296796169334463
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
816615 | red | 0.20350817237259144
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
(15 rows)
Here is the same using an index:
postgres=# set enable_indexscan to on;
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
--where color='green'
order by ann_cosine limit 15;
id | color | ann_cosine
---------+-------+---------------------
1360308 | red | 0.20296796169334463
1732580 | blue | 0.20459117837053364
1085082 | red | 0.20626595570441808
1875277 | green | 0.2076671534464677
by Franck Pachot
No pre-filtering in pgvector means reduced ANN recall
AI applications are expanding rapidly, and PostgreSQL is a popular choice among relational databases. The pgvector extension, a third-party add-on, enhances PostgreSQL by introducing a high-dimensional vector data type with similarity operations and search indexing.
Integrating embeddings directly into general-purpose databases eliminates the need for a separate one. Typically, approximate searches on embeddings are performed alongside exact searches on various other attributes, SQL columns or document fields, such as metadata, dates, or other dimensions.
PostgreSQL offers various index types, but it has notable limitations when combining them, as we have seen in PostgreSQL JSONB Indexing Limitations with B-Tree and GIN. Likewise, pgvector encounters similar issues.
Some users have moved to MongoDB Atlas Vector Search because it offers pre-filtering capabilities. They had incomplete results with PostgreSQL pgvector when filtering with other predicates. To better understand the impact of lacking pre-filtering in such scenarios, I built this simple demo.
Setup PostgreSQL with pgvector
I started a pgvector container:
docker run --name pgv -d -e POSTGRES_PASSWORD=franck pgvector/pgvector:0.8.0-pg17
docker exec -it pgv psql -U postgres
I enable the extension:
create extension if not exists vector;
Importing a synthetic dataset
I create a function to generate a random vector:
create function random_embedding(dimensions int) returns vector as
$$
select
array(
select random()::real
from generate_series(1, dimensions)
)::vector
$$ language sql;
I create a table to store embeddings ("embedding") with some metadata ("color"):
create table embeddings_table (
id bigserial primary key,
color text,
embedding vector(512)
);
I inserted two million rows, each containing a randomly generated 512-dimensional vector, and assigned one of three colors as metadata:
insert into embeddings_table (embedding,color)
select random_embedding(512)
,(array['red', 'green', 'blue'])[generate_series(1, 2000000)%3+1]
;
I used the random() function to ensure data is evenly distributed. The points are positioned in a 512-dimensional space, with one-third of the rows assigned to each color. This is a synthetic dataset, which makes it easier to understand the result.
I create a vector index, HNSW (Hierarchical Navigable Small Worlds), on the embeddings, using cosine similarity:
create index i1 on embeddings_table
using hnsw ( embedding vector_cosine_ops )
;
Query example
I generated one more random vector to use in my queries:
select random_embedding(512)
\gset
postgres=# select :'random_embedding';
?column?
---------------------------------------------------------------------
[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]
The cosine similarity search will find the points in the table for which the angle to axis is close to the reference point that I've stored in :'random_embedding'
variable.
I want to query the 15 points that are most similar to this reference point, but only consider the green category.
The following query filters on green rows (where color='green'
), calculates the cosine similarity (embedding <=> :'random_embedding'
) and filters the nearest 15 points (order by nn_cosine limit 15
):
select id , color, embedding <=> :'random_embedding' nn_cosine
from embeddings_table
where color='green'
order by nn_cosine limit 15;
I used "nn_cosine" for the nearest neighbor cosine search. In future queries, I'll use "enn_cosine" or "ann_cosine" depending on whether I expect an exact or approximate result, from a full table scan or an index scan.
Embeddings have too many dimensions for us to visualize easily, but here's an analogy in our three-dimensional world. My dataset is like a soft ball pool with red, green, and blue balls, where each ball's position represents the meaning of the data. Cosine similarity search is akin to pointing a laser from the center of the pool to a reference point, which corresponds to the meaning we are looking for, and identifying balls whose positions form the smallest angles with the laser ray. Post-filtering searches all balls, then discards red and blue balls afterward. Pre-filtering considers only green balls when searching around the laser ray.
Exact Nearest Neighbors (ENN) with full scan
First, I disable the index to get an exact result:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
order by enn_cosine limit 15
;
id | color | enn_cosine
---------+-------+---------------------
1428352 | green | 0.19814620075833056
328933 | green | 0.2024464516951111
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
1328614 | green | 0.20373734017866685
327802 | green | 0.20464172025637872
238738 | green | 0.2048113256211399
969943 | green | 0.20566046923407266
1924495 | green | 0.2059847615560182
486043 | green | 0.20615577737388402
1911601 | green | 0.20652312839386933
1777339 | green | 0.20658742123960594
875029 | green | 0.20664456413189736
(15 rows)
Without an index, this query is slow because it calculates the distance for each row that meets the 'color' predicate, sorts them by this distance, and retrieves the Top-15 results, but it has the advantage of providing an exact result:
postgres=# explain (analyze, buffers, costs off, summary on)
select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
order by enn_cosine limit 15
;
QUERY PLAN
-------------------------------------------------------------------
Limit (actual time=1868.024..1878.636 rows=15 loops=1)
Buffers: shared hit=1989174 read=692354
-> Gather Merge (actual time=1868.022..1878.632 rows=15 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1989174 read=692354
-> Sort (actual time=1853.062..1853.063 rows=13 loops=3)
Sort Key: ((embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector))
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=1989174 read=692354
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on embeddings_table (actual time=0.126..1797.436 rows=222222 loops=3)
Filter: (color = 'green'::text)
Rows Removed by Filter: 444444
Buffers: shared hit=1989107 read=692347
Planning:
Buffers: shared read=1
Planning Time: 0.124 ms
Execution Time: 1878.658 ms
PostgreSQL utilized a parallel degree of 3. A filter on "color" reduced the number of rows to 222222 per process, resulting in 666666 rows. This filter eliminated 444444 in each worker, which accounts for two-thirds of the total rows. Each process calculated distances for its assigned rows and sorted them accordingly. Finally, the coordinator gathered the top 15 results from the worker processes.
Approximate Nearest Neighbors (ANN) with index
I enable the index to get a faster, but approximate, result:
postgres=# set enable_indexscan to on;
SET
postgres=# explain (analyze, buffers, costs off, summary on)
select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
where color='green'
order by ann_cosine limit 15
;
QUERY PLAN
-------------------------------------------------------------------
Limit (actual time=5.605..5.916 rows=11 loops=1)
Buffers: shared hit=84 read=1470
-> Index Scan using i1 on embeddings_table (actual time=5.604..5.912 rows=11 loops=1)
Order By: (embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector)
Filter: (color = 'green'::text)
Rows Removed by Filter: 29
Buffers: shared hit=84 read=1470
Planning:
Buffers: shared read=1
Planning Time: 0.089 ms
Execution Time: 5.934 ms
The index was used to retrieve rows in their cosine similarity order related to my reference point (Order By: (embedding <=> '[...]'::vector)
), but the search was limited to 40 candidates (the default hnsw.ef_search
). 21 rows were discarded by the metadata filter ((color = 'green'::text)
), leaving 11 rows remaining (rows=11
). Because of this, I didn't have enough candidates for the expected result (limit 15
) and I get less rows than expected:
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
where color='green'
order by ann_cosine limit 15
;
id | color | ann_cosine
---------+-------+---------------------
1875277 | green | 0.2076671534464677
222817 | green | 0.21016644773554916
1375990 | green | 0.2118530398002575
21538 | green | 0.21207386707694031
1355350 | green | 0.2121940467579876
505036 | green | 0.21220934429072225
1570702 | green | 0.21469847813732257
1997836 | green | 0.21482420378988654
1195270 | green | 0.21613844835346685
634417 | green | 0.2172001587963871
1995160 | green | 0.21794015870874028
(11 rows)
I used the default hnsw.iterative_scan
set to off
. Enabling iterative scan will get more results by running the scan again until there's enough candidate for limit 15
.
Compare ENN and ANN
The nearest green neighbor found by the index has a cosine similarity of 0.2076671534464677, but the exact search identified fifteen closer green points. This discrepancy highlights a characteristic of Hierarchical Navigable Small Worlds (HNSW). The index is constructed with multiple layers, beginning with sparse upper layers that serve as entry points. These layers help restrict the search scope as it descends into the dense bottom layer (layer 0). While this layered design accelerates search times, it can also miss closer neighbors in layer 0 that were not explored when descending from the upper layers.
Since all points in my dataset were generated using random() values, it lacks natural clustering. This diminishes routing efficiency and increases the likelihood of missing closer neighbors during approximate searches. Therefore, I am illustrating the worst-case scenario.
I checked how many results the approximate search missed by disabling the pgvector index. I found that 23 results were missed before the first row from exact search, which is the 24th row in the following:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
and embedding <=> :'random_embedding' <= 0.2076671534464677
order by enn_cosine
;
id | color | enn_cosine
---------+-------+---------------------
1428352 | green | 0.19814620075833056
328933 | green | 0.2024464516951111
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
1328614 | green | 0.20373734017866685
327802 | green | 0.20464172025637872
238738 | green | 0.2048113256211399
969943 | green | 0.20566046923407266
1924495 | green | 0.2059847615560182
486043 | green | 0.20615577737388402
1911601 | green | 0.20652312839386933
1777339 | green | 0.20658742123960594
875029 | green | 0.20664456413189736
593119 | green | 0.2066683273490607
1354432 | green | 0.20685417261064953
1898782 | green | 0.20697419915308368
1429552 | green | 0.20704169544999784
1293397 | green | 0.20746811422822542
1371502 | green | 0.20746937923342468
998884 | green | 0.2074836628885286
845659 | green | 0.20759016691317878
1875277 | green | 0.2076671534464677
(24 rows)
The measure for this approximation accuracy is called 'recall'. The definition from the MongoDB glossary is:
Recall measures the fraction of true nearest neighbors that were returned by an ANN search. This measure reflects how close the algorithm approximates the results of ENN search.
This approximation applies to all index searches, which are faster than full scans but may miss some closer neighbors. Post-filtering reduces recall even further, as some candidates are discarded, leading to the possibility of missing good matches. That's why it is better to use pre-filtering on large databases. Although pre-filtering is not available in PostgreSQL pgvector, we can analyze the data to understand its potential impact.
Post-filtering impact on recall
The problem is that pgvector lacks filtering capabilities in the index. As a result, it defaults to selecting 40 candidates, as defined by the ef_search
parameter, and filtering on more columns, like "color', reduces the result.
To explain this, I've run the query without the filter, showing the first 40 candidates of the three colors:
postgres=# set enable_indexscan to on;
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
-- where color='green'
order by ann_cosine limit 40 -- current_setting('hnsw.ef_search')
;
id | color | ann_cosine
---------+-------+---------------------
1360308 | red | 0.20296796169334463
1732580 | blue | 0.20459117837053364
1085082 | red | 0.20626595570441808
1875277 | green | 0.2076671534464677
656874 | red | 0.20930762441475093
504242 | blue | 0.20984078446453025
247001 | blue | 0.20995935279258404
222817 | green | 0.21016644773554916
705833 | blue | 0.2102792157006329
1966310 | blue | 0.21028852384517327
1503321 | red | 0.21044019511424406
480572 | blue | 0.21100294080666748
1375990 | green | 0.2118530398002575
21538 | green | 0.21207386707694031
1355350 | green | 0.2121940467579876
505036 | green | 0.21220934429072225
106983 | red | 0.21293893184210688
238458 | red | 0.21295064471740388
1008380 | blue | 0.21301481665902566
298931 | blue | 0.21304336639331967
1147263 | red | 0.21342607115241874
562604 | blue | 0.2135493812411281
1570702 | green | 0.21469847813732257
1997836 | green | 0.21482420378988654
1291274 | blue | 0.2159207514557735
1195270 | green | 0.21613844835346685
1035932 | blue | 0.21623180106532514
1010237 | blue | 0.2164365008134519
1256966 | blue | 0.21652825716033564
1748231 | blue | 0.21676377376711842
634417 | green | 0.2172001587963871
1685391 | red | 0.21723542532805584
1964021 | blue | 0.21723845625858207
1056446 | blue | 0.21757530726298147
958670 | blue | 0.21769898462687431
1558046 | blue | 0.2177076235462454
516734 | blue | 0.21777311307937175
1995160 | green | 0.21794015870874028
228096 | red | 0.21866579506700412
660161 | blue | 0.2187276449697918
(40 rows)
This is what the query using the index did in a first step. If you keep only the green rows, you get 11 rows. If it was filtered before, the index scan would have returned 40 green rows and the query would have been able to return the Top-15 from it. Post-filtering misses some good green candidates that were ignored because some others where selected.
It would have been better to pre-filter during the approximate nearest neighbor search to consider only 'green' neighbors, rather than discarding non-matching ones afterward based on a post-filter on 'color'. However, pgvector does not support such pre-filtering. The consequence is a low recall in a single index scan.
Without filter - ENN and ANN
For queries without a filter, recall is generally better because, although approximate searches may miss some points, all candidates returned by the index are included in the results. Here are the Top 15 most similar points across all colors:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
--where color='green'
order by enn_cosine limit 15;
id | color | enn_cosine
---------+-------+---------------------
1506704 | blue | 0.1943345774574703
1428352 | green | 0.19814620075833056
905583 | red | 0.1986930398354949
1887345 | red | 0.19958922153843262
1408551 | red | 0.20155542317891084
1761962 | blue | 0.20168765608150285
91001 | blue | 0.20206633541960917
328933 | green | 0.2024464516951111
493388 | blue | 0.20277316748365937
1360308 | red | 0.20296796169334463
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
816615 | red | 0.20350817237259144
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
(15 rows)
Here is the same using an index:
postgres=# set enable_indexscan to on;
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
--where color='green'
order by ann_cosine limit 15;
id | color | ann_cosine
---------+-------+---------------------
1360308 | red | 0.20296796169334463
1732580 | blue | 0.20459117837053364
1085082 | red | 0.20626595570441808
1875277 | green | 0.2076671534464677
by Franck Pachot
Open Wins Again II: From Fork to Force
June 23, 2025
Migrate a self-managed MySQL database to Amazon Aurora MySQL using AWS DMS homogeneous data migrations
June 22, 2025
faster $lookup after $group in MongoDB aggregation pipeline
In the previous post One million $lookup challenge I said that $lookup is efficient at the end of an aggregation pipeline, not before the aggregation and I was referencing another post, Comparison of JOINS 👉🏻 aggregation pipeline and CTEs where I simply inverted the join ($lookup
) and aggregation ($group
) to fix a bad query. What if you have to lookup from millions of documents without an aggregation, like in the One million $lookup challenge?
I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:
db.dim.drop();
db.fact.drop();
db.dim.insertMany(
Array.from({ length: 1000 }, (_, i) => ({
_id: i + 1,
value: Math.random()
}))
);
db.fact.insertMany(
Array.from({ length: 1000000 }, () => ({
ref: Math.ceil(Math.random() * 1000),
value: Math.random()
}))
);
A many-to-one relationship should embed values directly, eliminating the need for lookups within a document model. However, I diverged from this recommendation to build a demo illustrating that lookups are acceptable when dealing with a limited number of documents, such as after aggregation.
Lookup (IndexedLoopJoin): 10 seconds
Here is an aggregation pipeline with a lookup.
x=db.fact.aggregate([
{
$lookup: {
from: "dim",
localField: "ref",
foreignField: "_id",
as: "dim" ,
}
},
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")
On this data, the operation completes in ten seconds. While the Index Nested Loop Join on one million documents with a scalar "ref" takes some time, performance can be improved by avoiding multiple lookups for the same value.
$group , $lookup , $unwind : 3 seconds
I group by the "ref" values prior to the join to minimize lookup operations, ensuring each value is processed only once:
x=db.fact.aggregate([
{ // one group per "ref" with an array of "facts"
$group: {
_id: "$ref",
facts: { $push: "$$ROOT" },
}
},
{ // join from the group
$lookup: {
from: "dim",
localField: "_id",
foreignField: "_id",
as: "dim"
}
},
{ // un-group to get the
$unwind: "$facts"
},
{ // project to original fields
$project: {
_id: "$facts._id",
ref: "$facts.ref",
value: "$facts.value",
dim: "$dim",
}
}
]).explain("executionStats");
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")
This takes three seconds to run. When only one nested loop per thousand documents is used, compared to one million, the lookup stage becomes faster. Although grouping and unwinding add some time, if grouping is already present, as is typical in an aggregation pipeline, performing the lookup after grouping is simply more efficient than doing it beforehand.
faster $lookup after $group in MongoDB aggregation pipeline
In the previous post One million $lookup challenge I said that $lookup is efficient at the end of an aggregation pipeline, not before the aggregation and I was referencing another post, Comparison of JOINS 👉🏻 aggregation pipeline and CTEs where I simply inverted the join ($lookup
) and aggregation ($group
) to fix a bad query. What if you have to lookup from millions of documents without an aggregation, like in the One million $lookup challenge?
I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:
db.dim.drop();
db.fact.drop();
db.dim.insertMany(
Array.from({ length: 1000 }, (_, i) => ({
_id: i + 1,
value: Math.random()
}))
);
db.fact.insertMany(
Array.from({ length: 1000000 }, () => ({
ref: Math.ceil(Math.random() * 1000),
value: Math.random()
}))
);
A many-to-one relationship should embed values directly, eliminating the need for lookups within a document model. However, I diverged from this recommendation to build a demo illustrating that lookups are acceptable when dealing with a limited number of documents, such as after aggregation.
Lookup (IndexedLoopJoin): 10 seconds
Here is an aggregation pipeline with a lookup.
x=db.fact.aggregate([
{
$lookup: {
from: "dim",
localField: "ref",
foreignField: "_id",
as: "dim" ,
}
},
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")
On this data, the operation completes in ten seconds. While the Index Nested Loop Join on one million documents with a scalar "ref" takes some time, performance can be improved by avoiding multiple lookups for the same value.
$group , $lookup , $unwind : 3 seconds
I group by the "ref" values prior to the join to minimize lookup operations, ensuring each value is processed only once:
x=db.fact.aggregate([
{ // one group per "ref" with an array of "facts"
$group: {
_id: "$ref",
facts: { $push: "$$ROOT" },
}
},
{ // join from the group
$lookup: {
from: "dim",
localField: "_id",
foreignField: "_id",
as: "dim"
}
},
{ // un-group to get the
$unwind: "$facts"
},
{ // project to original fields
$project: {
_id: "$facts._id",
ref: "$facts.ref",
value: "$facts.value",
dim: "$dim",
}
}
]).explain("executionStats");
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")
This takes three seconds to run. When only one nested loop per thousand documents is used, compared to one million, the lookup stage becomes faster. Although grouping and unwinding add some time, if grouping is already present, as is typical in an aggregation pipeline, performing the lookup after grouping is simply more efficient than doing it beforehand.
June 21, 2025
Debugging memory leaks in Postgres, jemalloc edition
This is an external post of mine. Click here if you are not redirected.