June 30, 2025
We Made Postgres Writes Faster, but it Broke Replication
June 28, 2025
Flush to disk on commit šš» MongoDB durable writes
A Twitter (š) thread was filled with misconceptions about MongoDB, spreading fear, uncertainty, and doubt (FUD). This led one user to question whether MongoDB acknowledges writes before they are actually flushed to disk:
Doesn't MongoDB acknowledge writes before it's actually flushed them to disk?
MongoDB, like many databases, employs journalingāalso known as write-ahead logging (WAL)āto ensure durability (the D in ACID) with high performance. This involves safely recording write operations in the journal, and ensuring they are flushed to disk before the commit is acknowledged. Further details can be found in the documentation under Write Concern and Journaling
Here is how you can test it, in a lab, with Linux STRACE and GDB, to debunk the myths.
Start the lab
I created a local MongoDB server. I uses a single-node local atlas cluster here but you can do the same with replicas:
atlas deployments setup atlas --type local --port 27017 --force
Start it if it was stopped, and connect with MongoDB Shell:
atlas deployment start atlas
mongosh
Trace the system calls with strace
In another terminal, I used strace to display the system calls (-e trace) to write (pwrite64) and sync (fdatasync) the files, with the file names (-yy), by the MongoDB server process (-p $(pgrep -d, mongod)) and its threads (-f), with the execution time and timestamp (-tT):
strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0
Some writes and sync happen in the background
[pid 2625869] 08:26:13 fdatasync(11</data/db/WiredTiger.wt>) = 0 <0.000022>
[pid 2625869] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19072) = 384 <0.000024>
[pid 2625869] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002123>
[pid 2625868] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 128, 19456) = 128 <0.000057>
[pid 2625868] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002192>
[pid 2625868] 08:26:23 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19584) = 384 <0.000057>
[pid 2625868] 08:26:23 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002068>
[pid 2625868] 08:26:33 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19968) = 384 <0.000061>
[pid 2625868] 08:26:33 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002747>
[pid 2625868] 08:26:43 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20352) = 384 <0.000065>
[pid 2625868] 08:26:43 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.003008>
[pid 2625868] 08:26:53 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20736) = 384 <0.000075>
[pid 2625868] 08:26:53 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002092>
[pid 2625868] 08:27:03 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 21120) = 384 <0.000061>
[pid 2625868] 08:27:03 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002527>
[pid 2625869] 08:27:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.000033>
Write to the collection
In the MongoDB shell, I created a collection and ran ten updates:
db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });
for (let i = 1; i <= 10; i++) {
db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} ${new Date()}`)
}
The strace output the following when running the loop of ten updates:
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76288) = 512 <0.000066>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001865>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76800) = 512 <0.000072>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77312) = 512 <0.000056>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001641>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77824) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78336) = 512 <0.000175>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001944>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78848) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001829>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79360) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001917>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79872) = 512 <0.000050>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002260>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80384) = 512 <0.000035>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001940>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80896) = 512 <0.000054>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001984>
Each write (pwrite64) to the journal files was followed by a sync to disk (fdatasync). This system call is well documented:
FSYNC(2) Linux Programmer's Manual FSYNC(2)
NAME
fsync, fdatasync - synchronize a file's in-core state with storage device
DESCRIPTION
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to
the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.
This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.
...
fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification
...
The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.
Since I display both the committed time and the system call trace times, you can see that they match. The output related to the traces above demonstrates this alignment:
1 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
2 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
3 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
4 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
5 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
6 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
7 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
8 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
9 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
10 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
Multi-document transactions
The previous example ran ten autocommit updates, each calling a synchronisation to disk.
In general, with good document data modeling, a document should match the business transaction. However, it is possible to use multi-document transaction and they are ACID (atomic, consistent, isolated and durable). Using multi-document transactions also reduces the sync latency as it is required only once per transaction, at commit.
I've run the following with five transactions, each running one update and one insert:
const session = db.getMongo().startSession();
for (let i = 1; i <= 5; i++) {
session.startTransaction();
const sessionDb = session.getDatabase(db.getName());
sessionDb.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} updated ${new Date()}`)
sessionDb.mycollection.insertOne( { answer:42 });
print(` ${i} inserted ${new Date()}`)
session.commitTransaction();
print(` ${i} committed ${new Date()}`)
}
Strace still shows ten calls to pwrite64 and fdatasync. I used this multi-document transaction to go further and prove that not only the commit triggers a sync to disk, but also waits for its acknlowledgement before returning a sucessful feedback to the application.
Inject some latency with gdb
To show that the commit waits for the acknowledgment of fdatasync I used a GDB breakpoint for the fdatasyc call.
I stopped strace, and started GDB with a script that adds a latency of five seconds to fdatasync:
cat > gdb_slow_fdatasync.gdb <<GDB
break fdatasync
commands
shell sleep 5
continue
end
continue
GDB
gdb --batch -x gdb_slow_fdatasync.gdb -p $(pgrep mongod)
I ran the five transactions and two writes. GDB shows when it hits the breakpoint:
Thread 31 "JournalFlusher" hit Breakpoint 1, 0x0000ffffa6096eec in fdatasync () from target:/lib64/libc.so.6
My GDB script automatically waits fives seconds and continues the program, until the next call to fdatasync.
Here was the output from my loop with five transactions:
1 updated Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
1 inserted Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
1 committed Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 updated Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 inserted Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 committed Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 updated Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 inserted Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 committed Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 updated Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 inserted Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 committed Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
5 updated Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
5 inserted Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
The insert and update operations occur immediately, but the commit itself waits five seconds, because of the latency I injected with GDB. This demonstrates that the commit waits for fdatasync, guaranteeing the flush to persistent storage. For this demo, I used all default settings in MongoDB 8.0, but this behavior can still be tuned through write concern and journaling configurations.
Next time you encounter claims from ignorants people or detractors suggesting that MongoDB is not consistent or fails to flush committed changes to disk, you can confidently debunk these myths by referring to official documentation and conducting your own experiments.
Flush to disk on commit šš» MongoDB durable writes
A Twitter (š) thread was filled with misconceptions about MongoDB, spreading fear, uncertainty, and doubt (FUD). This led one user to question whether MongoDB acknowledges writes before they are actually flushed to disk:
Doesn't MongoDB acknowledge writes before it's actually flushed them to disk?
MongoDB, like many databases, employs journalingāalso known as write-ahead logging (WAL)āto ensure durability (the D in ACID) with high performance. This involves safely recording write operations in the journal, and ensuring they are flushed to disk before the commit is acknowledged. Further details can be found in the documentation under Write Concern and Journaling
Here is how you can test it, in a lab, with Linux STRACE and GDB, to debunk the myths.
Start the lab
I created a local MongoDB server. I uses a single-node local atlas cluster here but you can do the same with replicas:
atlas deployments setup atlas --type local --port 27017 --force
Start it if it was stopped, and connect with MongoDB Shell:
atlas deployment start atlas
mongosh
Trace the system calls with strace
In another terminal, I used strace to display the system calls (-e trace) to write (pwrite64) and sync (fdatasync) the files, with the file names (-yy), by the MongoDB server process (-p $(pgrep -d, mongod)) and its threads (-f), with the execution time and timestamp (-tT):
strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0
Some writes and sync happen in the background
[pid 2625869] 08:26:13 fdatasync(11</data/db/WiredTiger.wt>) = 0 <0.000022>
[pid 2625869] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19072) = 384 <0.000024>
[pid 2625869] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002123>
[pid 2625868] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 128, 19456) = 128 <0.000057>
[pid 2625868] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002192>
[pid 2625868] 08:26:23 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19584) = 384 <0.000057>
[pid 2625868] 08:26:23 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002068>
[pid 2625868] 08:26:33 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19968) = 384 <0.000061>
[pid 2625868] 08:26:33 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002747>
[pid 2625868] 08:26:43 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20352) = 384 <0.000065>
[pid 2625868] 08:26:43 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.003008>
[pid 2625868] 08:26:53 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20736) = 384 <0.000075>
[pid 2625868] 08:26:53 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002092>
[pid 2625868] 08:27:03 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 21120) = 384 <0.000061>
[pid 2625868] 08:27:03 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002527>
[pid 2625869] 08:27:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.000033>
Write to the collection
In the MongoDB shell, I created a collection and ran ten updates:
db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });
for (let i = 1; i <= 10; i++) {
print(` ${i} ${new Date()}`)
db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} ${new Date()}`)
}
The strace output the following when running the loop of ten updates:
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76288) = 512 <0.000066>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001865>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76800) = 512 <0.000072>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77312) = 512 <0.000056>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001641>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77824) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78336) = 512 <0.000175>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001944>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78848) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001829>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79360) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001917>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79872) = 512 <0.000050>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002260>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80384) = 512 <0.000035>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001940>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80896) = 512 <0.000054>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001984>
Each write (pwrite64) to the journal files was followed by a sync to disk (fdatasync). This system call is well documented:
FSYNC(2) Linux Programmer's Manual FSYNC(2)
NAME
fsync, fdatasync - synchronize a file's in-core state with storage device
DESCRIPTION
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to
the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.
This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.
...
fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification
...
The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.
Since I display both the committed time and the system call trace times, you can see that they match. The output related to the traces above demonstrates this alignment:
1 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
2 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
3 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
4 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
5 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
6 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
7 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
8 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
9 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
10 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
Multi-document transactions
The previous example ran ten autocommit updates, each calling a synchronisation to disk.
In general, with good document data modeling, a document should match the business transaction. However, it is possible to use multi-document transaction and they are ACID (atomic, consistent, isolated and durable). Using multi-document transactions also reduces the sync latency as it is required only once per transaction, at commit.
I've run the following with five transactions, each running one update and one insert:
const session = db.getMongo().startSession();
for (let i = 1; i <= 5; i++) {
session.startTransaction();
const sessionDb = session.getDatabase(db.getName());
sessionDb.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} updated ${new Date()}`)
sessionDb.mycollection.insertOne( { answer:42 });
print(` ${i} inserted ${new Date()}`)
session.commitTransaction();
print(` ${i} committed ${new Date()}`)
}
Strace still shows ten calls to pwrite64 and fdatasync. I used this multi-document transaction to go further and prove that not only the commit triggers a sync to disk, but also waits for its acknlowledgement before returning a sucessful feedback to the application.
Inject some latency with gdb
To show that the commit waits for the acknowledgment of fdatasync I used a GDB breakpoint for the fdatasyc call.
I stopped strace, and started GDB with a script that adds a latency of five seconds to fdatasync:
cat > gdb_slow_fdatasync.gdb <<GDB
break fdatasync
commands
shell sleep 5
continue
end
continue
GDB
gdb --batch -x gdb_slow_fdatasync.gdb -p $(pgrep mongod)
I ran the five transactions and two writes. GDB shows when it hits the breakpoint:
Thread 31 "JournalFlusher" hit Breakpoint 1, 0x0000ffffa6096eec in fdatasync () from target:/lib64/libc.so.6
My GDB script automatically waits fives seconds and continues the program, until the next call to fdatasync.
Here was the output from my loop with five transactions:
1 updated Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
1 inserted Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
1 committed Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 updated Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 inserted Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 committed Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 updated Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 inserted Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 committed Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 updated Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 inserted Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 committed Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
5 updated Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
5 inserted Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
The insert and update operations occur immediately, but the commit itself waits five seconds, because of the latency I injected with GDB. This demonstrates that the commit waits for fdatasync, guaranteeing the flush to persistent storage. For this demo, I used all default settings in MongoDB 8.0, but this behavior can still be tuned through write concern and journaling configurations.
I used GDB to examine the call stack. Alternatively, you can inject a delay with strace by adding this option: -e inject=fdatasync:delay_enter=5000000.
Look at the open source code
When calling fdatasync, errors can occur, and this may compromise durability if operations on the file descriptor continue (remember the PostgreSQL fsyncgate). MongoDB uses the open-source WiredTiger storage engine, which implemented the same solution as PostgreSQL to avoid that: panic instead of retry. You can review the os_fs.c code to verify this.
The fdatasync call is in the JournalFlusher thread and here is the backtrace:
#0 0x0000ffffa0b5ceec in fdatasync () from target:/lib64/libc.so.6
#1 0x0000aaaadf5312c0 in __posix_file_sync ()
#2 0x0000aaaadf4f53c8 in __log_fsync_file ()
#3 0x0000aaaadf4f58d4 in __wt_log_force_sync ()
#4 0x0000aaaadf4fb8b8 in __wt_log_flush ()
#5 0x0000aaaadf588348 in __session_log_flush ()
#6 0x0000aaaadf41b878 in mongo::WiredTigerSessionCache::waitUntilDurable(mongo::OperationContext*, mongo::WiredTigerSessionCache::Fsync, mongo::WiredTigerSessionCache::UseJournalListener) ()
#7 0x0000aaaadf412358 in mongo::WiredTigerRecoveryUnit::waitUntilDurable(mongo::OperationContext*) ()
#8 0x0000aaaadfbe855c in mongo::JournalFlusher::run() ()
Here are some entrypoints if you want to look at the code behind this:
- JournalFlusher, which calls the posix system call in WiredTiger's os_fs.c: journal_flusher.cpp
- waitForWriteConcern, calling the JournalFlusher from the connection's thread: write_concern.cpp
Have your opinions based on facts, not myths.
MongoDB began as a NoSQL database that prioritized availability and low latency over strong consistency. However, that was over ten years ago. As technology evolves, experts who refuse to constantly learn risk their knowledge becoming outdated, their skills diminishing, and their credibility suffering.
Today, MongoDB is a general-purpose database that supports transaction atomicity, consistency, isolation, and durabilityāwhether the transaction involves a single document, or multiple documents.
Next time you encounter claims from ignorants or detractors suggesting that MongoDB is not consistent or fails to flush committed changes to disk, you can confidently debunk these myths by referring to official documentation, the open source code, and conducting your own experiments. MongoDB is similar to PostgreSQL: buffered writes and WAL sync to disk on commit.
Want to meet people, try charging them for it?
I have been blogging consistently since 2017. And one of my goals in speaking publicly was always to connect with like-minded people. I always left my email and hoped people would get in touch. Even while my blog and twitter became popular, passing 1M views and 20k followers, I basically never had people get in touch to chat or meet up.
So it felt kind of ridiculous when last November I started charging people $100 to chat. I mean, who am I? But people started showing up fairly immediately. Now granted the money did not go to me. It went to an education non-profit and I merely received the receipt.
And at this point I've met a number of interesting people, from VCs to business professors to undergraduate students to founders and everyone in between. People wanting to talk about trends in databases, about how to succeed as a programmer, about marketing for developers, and so on. Women and men thoughout North America, Europe, Africa, New Zealand, India, Nepal, and so on. And I've raised nearly $6000 for educational non-profits.
How is it that you go from giving away your time for free and getting no hits to charging and almost immediately getting results? For one, every person responded very positively to it being a fundraiser. It also helps me be entirely shameless about sharing on social media every single time someone donates; because it's such a positive thing.
But also I think that in "charging" for my time it helps people feel more comfortable about actually taking my time, especially when we have never met. It gives you a reasonable excuse to take time from an internet rando.
On the other hand, a lot of people come for advice and I think giving advice is pretty dangerous, especially since my background is not super conventional. I try to always frame things as just sharing my opinion and my perspective and that they should talk with many others and not take my suggestions without consideration.
And there's also the problem that by charging everyone for my time now, I'm no longer available to people who could maybe use it the most. I do mention on my page that I will still take calls from people who don't donate, as my schedule allows. But to be honest I feel less incentivized to spend time when people do not donate. So I guess this is an issue with the program.
But I mitigated even this slightly, and significantly jump-started the program, during my 30th birthday when I took calls with any person who donated at least $30.
Anyway, I picked this path because I have wanted to get involved with helping students figure out their lives and careers. But without a degree I am literally unqualified for many volunteering programs. And I always found the time commitments for non-profits painful.
So until starting this I figured it wouldn't be until I retire that I find some way to make a difference. But ultimately I kept meeting people who were starting their own non-profits now or donated significantly to help students. Peer pressure. I wanted to do my part now. And 30 minutes of my time in return for a donation receipt has been an easy trade.
While only raising a humble $6,000 to date, the Chat for Education program has been more successful than I imagined. I've met many amazing people through it. And it's something that should be easy to keep up indefinitely.
I hope to meet you through it too!
June 27, 2025
Supercharging AWS database development with AWS MCP servers
Managing PostgreSQL on Kubernetes with Percona Everestās REST API
Why we ditched Prometheus for autoscaling (and don't miss it)
Scaling up ClickHouse ingestion with a Multi-writer architecture
Cutting ingestion CPU usage by 25% with smarter JSON to ClickHouse RowBinary conversion
Why we ditched Prometheus for autoscaling (and don't miss it)
Scaling up ClickHouse ingestion with a Multi-writer architecture
How we made our ingestion pipeline 30% faster with C++ (not Rust, sorry)
How we automatically handle ClickHouse schema migrations
How we automatically handle ClickHouse schema migrations
June 26, 2025
Percona XtraDB Cluster: Our Commitment to Open Source High Availability
Scaling Smarter: What You Have Missed in MongoDB 8.0
June 25, 2025
Building a job search engine with PostgreSQLās advanced search features
Using Percona Everest Operator CRDs to Manage Databases in Kubernetes
Build a Personalized AI Assistant with Postgres
Use CedarDB to search the CedarDB docs and blogs
Motivation
Not so long ago, I shared that I have an interest in finding things and, in that case, the question was about where something could be found. Another common requrement is, given some expression of an interest, finding the set of documents that best answers the question. For example, coupled with the geospatial question, we might include that we’re looking for Indian restaurants within the specified geographic area.
For this article, though, we’ll restrict the focus to the problem of finding the most relevant documents within some collection, where that collection just happens to be the CedarDB documentation. To that end, I’ll assert up front that my query “Does the CedarDB ‘asof join’ use an index?” should return a helpful response, while the query “Does pickled watermelon belong on a taco?” should ideally return an empty result.
June 24, 2025
No pre-filtering in pgvector means reduced ANN recall
AI applications are expanding rapidly, and PostgreSQL is a popular choice among relational databases. The pgvector extension, a third-party add-on, enhances PostgreSQL by introducing a high-dimensional vector data type with similarity operations and search indexing.
Integrating embeddings directly into general-purpose databases eliminates the need for a separate one. Typically, approximate searches on embeddings are performed alongside exact searches on various other attributes, SQL columns or document fields, such as metadata, dates, or other dimensions.
PostgreSQL offers various index types, but it has notable limitations when combining them, as we have seen in PostgreSQL JSONB Indexing Limitations with B-Tree and GIN. Likewise, pgvector encounters similar issues.
Some users have moved to MongoDB Atlas Vector Search because it offers pre-filtering capabilities. They had incomplete results with PostgreSQL pgvector when filtering with other predicates. To better understand the impact of lacking pre-filtering in such scenarios, I built this simple demo.
Setup PostgreSQL with pgvector
I started a pgvector container:
docker run --name pgv -d -e POSTGRES_PASSWORD=franck pgvector/pgvector:0.8.0-pg17
docker exec -it pgv psql -U postgres
I enable the extension:
create extension if not exists vector;
Importing a synthetic dataset
I create a function to generate a random vector:
create function random_embedding(dimensions int) returns vector as
$$
select
array(
select random()::real
from generate_series(1, dimensions)
)::vector
$$ language sql;
I create a table to store embeddings ("embedding") with some metadata ("color"):
create table embeddings_table (
id bigserial primary key,
color text,
embedding vector(512)
);
I inserted two million rows, each containing a randomly generated 512-dimensional vector, and assigned one of three colors as metadata:
insert into embeddings_table (embedding,color)
select random_embedding(512)
,(array['red', 'green', 'blue'])[generate_series(1, 2000000)%3+1]
;
I used the random() function to ensure data is evenly distributed. The points are positioned in a 512-dimensional space, with one-third of the rows assigned to each color. This is a synthetic dataset, which makes it easier to understand the result.
I create a vector index, HNSW (Hierarchical Navigable Small Worlds), on the embeddings, using cosine similarity:
create index i1 on embeddings_table
using hnsw ( embedding vector_cosine_ops )
;
Query example
I generated one more random vector to use in my queries:
select random_embedding(512)
\gset
postgres=# select :'random_embedding';
?column?
---------------------------------------------------------------------
[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]
The cosine similarity search will find the points in the table for which the angle to axis is close to the reference point that I've stored in :'random_embedding' variable.
I want to query the 15 points that are most similar to this reference point, but only consider the green category.
The following query filters on green rows (where color='green'), calculates the cosine similarity (embedding <=> :'random_embedding') and filters the nearest 15 points (order by nn_cosine limit 15):
select id , color, embedding <=> :'random_embedding' nn_cosine
from embeddings_table
where color='green'
order by nn_cosine limit 15;
I used "nn_cosine" for the nearest neighbor cosine search. In future queries, I'll use "enn_cosine" or "ann_cosine" depending on whether I expect an exact or approximate result, from a full table scan or an index scan.
Embeddings have too many dimensions for us to visualize easily, but here's an analogy in our three-dimensional world. My dataset is like a soft ball pool with red, green, and blue balls, where each ball's position represents the meaning of the data. Cosine similarity search is akin to pointing a laser from the center of the pool to a reference point, which corresponds to the meaning we are looking for, and identifying balls whose positions form the smallest angles with the laser ray. Post-filtering searches all balls, then discards red and blue balls afterward. Pre-filtering considers only green balls when searching around the laser ray.
Exact Nearest Neighbors (ENN) with full scan
First, I disable the index to get an exact result:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
order by enn_cosine limit 15
;
id | color | enn_cosine
---------+-------+---------------------
1428352 | green | 0.19814620075833056
328933 | green | 0.2024464516951111
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
1328614 | green | 0.20373734017866685
327802 | green | 0.20464172025637872
238738 | green | 0.2048113256211399
969943 | green | 0.20566046923407266
1924495 | green | 0.2059847615560182
486043 | green | 0.20615577737388402
1911601 | green | 0.20652312839386933
1777339 | green | 0.20658742123960594
875029 | green | 0.20664456413189736
(15 rows)
Without an index, this query is slow because it calculates the distance for each row that meets the 'color' predicate, sorts them by this distance, and retrieves the Top-15 results, but it has the advantage of providing an exact result:
postgres=# explain (analyze, buffers, costs off, summary on)
select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
order by enn_cosine limit 15
;
QUERY PLAN
-------------------------------------------------------------------
Limit (actual time=1868.024..1878.636 rows=15 loops=1)
Buffers: shared hit=1989174 read=692354
-> Gather Merge (actual time=1868.022..1878.632 rows=15 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1989174 read=692354
-> Sort (actual time=1853.062..1853.063 rows=13 loops=3)
Sort Key: ((embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector))
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=1989174 read=692354
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on embeddings_table (actual time=0.126..1797.436 rows=222222 loops=3)
Filter: (color = 'green'::text)
Rows Removed by Filter: 444444
Buffers: shared hit=1989107 read=692347
Planning:
Buffers: shared read=1
Planning Time: 0.124 ms
Execution Time: 1878.658 ms
PostgreSQL utilized a parallel degree of 3. A filter on "color" reduced the number of rows to 222222 per process, resulting in 666666 rows. This filter eliminated 444444 in each worker, which accounts for two-thirds of the total rows. Each process calculated distances for its assigned rows and sorted them accordingly. Finally, the coordinator gathered the top 15 results from the worker processes.
Approximate Nearest Neighbors (ANN) with index
I enable the index to get a faster, but approximate, result:
postgres=# set enable_indexscan to on;
SET
postgres=# explain (analyze, buffers, costs off, summary on)
select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
where color='green'
order by ann_cosine limit 15
;
QUERY PLAN
-------------------------------------------------------------------
Limit (actual time=5.605..5.916 rows=11 loops=1)
Buffers: shared hit=84 read=1470
-> Index Scan using i1 on embeddings_table (actual time=5.604..5.912 rows=11 loops=1)
Order By: (embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector)
Filter: (color = 'green'::text)
Rows Removed by Filter: 29
Buffers: shared hit=84 read=1470
Planning:
Buffers: shared read=1
Planning Time: 0.089 ms
Execution Time: 5.934 ms
The index was used to retrieve rows in their cosine similarity order related to my reference point (Order By: (embedding <=> '[...]'::vector)), but the search was limited to 40 candidates (the default hnsw.ef_search). 21 rows were discarded by the metadata filter ((color = 'green'::text)), leaving 11 rows remaining (rows=11). Because of this, I didn't have enough candidates for the expected result (limit 15) and I get less rows than expected:
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
where color='green'
order by ann_cosine limit 15
;
id | color | ann_cosine
---------+-------+---------------------
1875277 | green | 0.2076671534464677
222817 | green | 0.21016644773554916
1375990 | green | 0.2118530398002575
21538 | green | 0.21207386707694031
1355350 | green | 0.2121940467579876
505036 | green | 0.21220934429072225
1570702 | green | 0.21469847813732257
1997836 | green | 0.21482420378988654
1195270 | green | 0.21613844835346685
634417 | green | 0.2172001587963871
1995160 | green | 0.21794015870874028
(11 rows)
I used the default hnsw.iterative_scan set to off. Enabling iterative scan will get more results by running the scan again until there's enough candidate for limit 15.
Compare ENN and ANN
The nearest green neighbor found by the index has a cosine similarity of 0.2076671534464677, but the exact search identified fifteen closer green points. This discrepancy highlights a characteristic of Hierarchical Navigable Small Worlds (HNSW). The index is constructed with multiple layers, beginning with sparse upper layers that serve as entry points. These layers help restrict the search scope as it descends into the dense bottom layer (layer 0). While this layered design accelerates search times, it can also miss closer neighbors in layer 0 that were not explored when descending from the upper layers.
Since all points in my dataset were generated using random() values, it lacks natural clustering. This diminishes routing efficiency and increases the likelihood of missing closer neighbors during approximate searches. Therefore, I am illustrating the worst-case scenario.
I checked how many results the approximate search missed by disabling the pgvector index. I found that 23 results were missed before the first row from exact search, which is the 24th row in the following:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
and embedding <=> :'random_embedding' <= 0.2076671534464677
order by enn_cosine
;
id | color | enn_cosine
---------+-------+---------------------
1428352 | green | 0.19814620075833056
328933 | green | 0.2024464516951111
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
1328614 | green | 0.20373734017866685
327802 | green | 0.20464172025637872
238738 | green | 0.2048113256211399
969943 | green | 0.20566046923407266
1924495 | green | 0.2059847615560182
486043 | green | 0.20615577737388402
1911601 | green | 0.20652312839386933
1777339 | green | 0.20658742123960594
875029 | green | 0.20664456413189736
593119 | green | 0.2066683273490607
1354432 | green | 0.20685417261064953
1898782 | green | 0.20697419915308368
1429552 | green | 0.20704169544999784
1293397 | green | 0.20746811422822542
1371502 | green | 0.20746937923342468
998884 | green | 0.2074836628885286
845659 | green | 0.20759016691317878
1875277 | green | 0.2076671534464677
(24 rows)
The measure for this approximation accuracy is called 'recall'. The definition from the MongoDB glossary is:
Recall measures the fraction of true nearest neighbors that were returned by an ANN search. This measure reflects how close the algorithm approximates the results of ENN search.
This approximation applies to all index searches, which are faster than full scans but may miss some closer neighbors. Post-filtering reduces recall even further, as some candidates are discarded, leading to the possibility of missing good matches. That's why it is better to use pre-filtering on large databases. Although pre-filtering is not available in PostgreSQL pgvector, we can analyze the data to understand its potential impact.
Post-filtering impact on recall
The problem is that pgvector lacks filtering capabilities in the index. As a result, it defaults to selecting 40 candidates, as defined by the ef_search parameter, and filtering on more columns, like "color', reduces the result.
To explain this, I've run the query without the filter, showing the first 40 candidates of the three colors:
postgres=# set enable_indexscan to on;
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
-- where color='green'
order by ann_cosine limit 40 -- current_setting('hnsw.ef_search')
;
id | color | ann_cosine
---------+-------+---------------------
1360308 | red | 0.20296796169334463
1732580 | blue | 0.20459117837053364
1085082 | red | 0.20626595570441808
1875277 | green | 0.2076671534464677
656874 | red | 0.20930762441475093
504242 | blue | 0.20984078446453025
247001 | blue | 0.20995935279258404
222817 | green | 0.21016644773554916
705833 | blue | 0.2102792157006329
1966310 | blue | 0.21028852384517327
1503321 | red | 0.21044019511424406
480572 | blue | 0.21100294080666748
1375990 | green | 0.2118530398002575
21538 | green | 0.21207386707694031
1355350 | green | 0.2121940467579876
505036 | green | 0.21220934429072225
106983 | red | 0.21293893184210688
238458 | red | 0.21295064471740388
1008380 | blue | 0.21301481665902566
298931 | blue | 0.21304336639331967
1147263 | red | 0.21342607115241874
562604 | blue | 0.2135493812411281
1570702 | green | 0.21469847813732257
1997836 | green | 0.21482420378988654
1291274 | blue | 0.2159207514557735
1195270 | green | 0.21613844835346685
1035932 | blue | 0.21623180106532514
1010237 | blue | 0.2164365008134519
1256966 | blue | 0.21652825716033564
1748231 | blue | 0.21676377376711842
634417 | green | 0.2172001587963871
1685391 | red | 0.21723542532805584
1964021 | blue | 0.21723845625858207
1056446 | blue | 0.21757530726298147
958670 | blue | 0.21769898462687431
1558046 | blue | 0.2177076235462454
516734 | blue | 0.21777311307937175
1995160 | green | 0.21794015870874028
228096 | red | 0.21866579506700412
660161 | blue | 0.2187276449697918
(40 rows)
This is what the query using the index did in a first step. If you keep only the green rows, you get 11 rows. If it was filtered before, the index scan would have returned 40 green rows and the query would have been able to return the Top-15 from it. Post-filtering misses some good green candidates that were ignored because some others where selected.
It would have been better to pre-filter during the approximate nearest neighbor search to consider only 'green' neighbors, rather than discarding non-matching ones afterward based on a post-filter on 'color'. However, pgvector does not support such pre-filtering. The consequence is a low recall in a single index scan.
Without filter - ENN and ANN
For queries without a filter, recall is generally better because, although approximate searches may miss some points, all candidates returned by the index are included in the results. Here are the Top 15 most similar points across all colors:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
--where color='green'
order by enn_cosine limit 15;
id | color | enn_cosine
---------+-------+---------------------
1506704 | blue | 0.1943345774574703
1428352 | green | 0.19814620075833056
905583 | red | 0.1986930398354949
1887345 | red | 0.19958922153843262
1408551 | red | 0.20155542317891084
1761962 | blue | 0.20168765608150285
91001 | blue | 0.20206633541960917
328933 | green | 0.2024464516951111
493388 | blue | 0.20277316748365937
1360308 | red | 0.20296796169334463
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
816615 | red | 0.20350817237259144
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
(15 rows)
Here is the same using an index:
postgres=# set enable_indexscan to on;
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
--where color='green'
order by ann_cosine limit 15;
id | color | ann_cosine
---------+-------+---------------------
1360308 | red | 0.20296796169334463
1732580 | blue | 0.20459117837053364
1085082 | red | 0.20626595570441808
1875277 | green | 0.2076671534464677
by Franck Pachot
No pre-filtering in pgvector means reduced ANN recall
AI applications are expanding rapidly, and PostgreSQL is a popular choice among relational databases. The pgvector extension, a third-party add-on, enhances PostgreSQL by introducing a high-dimensional vector data type with similarity operations and search indexing.
Integrating embeddings directly into general-purpose databases eliminates the need for a separate one. Typically, approximate searches on embeddings are performed alongside exact searches on various other attributes, SQL columns or document fields, such as metadata, dates, or other dimensions.
PostgreSQL offers various index types, but it has notable limitations when combining them, as we have seen in PostgreSQL JSONB Indexing Limitations with B-Tree and GIN. Likewise, pgvector encounters similar issues.
Some users have moved to MongoDB Atlas Vector Search because it offers pre-filtering capabilities. They had incomplete results with PostgreSQL pgvector when filtering with other predicates. To better understand the impact of lacking pre-filtering in such scenarios, I built this simple demo.
Setup PostgreSQL with pgvector
I started a pgvector container:
docker run --name pgv -d -e POSTGRES_PASSWORD=franck pgvector/pgvector:0.8.0-pg17
docker exec -it pgv psql -U postgres
I enable the extension:
create extension if not exists vector;
Importing a synthetic dataset
I create a function to generate a random vector:
create function random_embedding(dimensions int) returns vector as
$$
select
array(
select random()::real
from generate_series(1, dimensions)
)::vector
$$ language sql;
I create a table to store embeddings ("embedding") with some metadata ("color"):
create table embeddings_table (
id bigserial primary key,
color text,
embedding vector(512)
);
I inserted two million rows, each containing a randomly generated 512-dimensional vector, and assigned one of three colors as metadata:
insert into embeddings_table (embedding,color)
select random_embedding(512)
,(array['red', 'green', 'blue'])[generate_series(1, 2000000)%3+1]
;
I used the random() function to ensure data is evenly distributed. The points are positioned in a 512-dimensional space, with one-third of the rows assigned to each color. This is a synthetic dataset, which makes it easier to understand the result.
I create a vector index, HNSW (Hierarchical Navigable Small Worlds), on the embeddings, using cosine similarity:
create index i1 on embeddings_table
using hnsw ( embedding vector_cosine_ops )
;
Query example
I generated one more random vector to use in my queries:
select random_embedding(512)
\gset
postgres=# select :'random_embedding';
?column?
---------------------------------------------------------------------
[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]
The cosine similarity search will find the points in the table for which the angle to axis is close to the reference point that I've stored in :'random_embedding' variable.
I want to query the 15 points that are most similar to this reference point, but only consider the green category.
The following query filters on green rows (where color='green'), calculates the cosine similarity (embedding <=> :'random_embedding') and filters the nearest 15 points (order by nn_cosine limit 15):
select id , color, embedding <=> :'random_embedding' nn_cosine
from embeddings_table
where color='green'
order by nn_cosine limit 15;
I used "nn_cosine" for the nearest neighbor cosine search. In future queries, I'll use "enn_cosine" or "ann_cosine" depending on whether I expect an exact or approximate result, from a full table scan or an index scan.
Embeddings have too many dimensions for us to visualize easily, but here's an analogy in our three-dimensional world. My dataset is like a soft ball pool with red, green, and blue balls, where each ball's position represents the meaning of the data. Cosine similarity search is akin to pointing a laser from the center of the pool to a reference point, which corresponds to the meaning we are looking for, and identifying balls whose positions form the smallest angles with the laser ray. Post-filtering searches all balls, then discards red and blue balls afterward. Pre-filtering considers only green balls when searching around the laser ray.
Exact Nearest Neighbors (ENN) with full scan
First, I disable the index to get an exact result:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
order by enn_cosine limit 15
;
id | color | enn_cosine
---------+-------+---------------------
1428352 | green | 0.19814620075833056
328933 | green | 0.2024464516951111
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
1328614 | green | 0.20373734017866685
327802 | green | 0.20464172025637872
238738 | green | 0.2048113256211399
969943 | green | 0.20566046923407266
1924495 | green | 0.2059847615560182
486043 | green | 0.20615577737388402
1911601 | green | 0.20652312839386933
1777339 | green | 0.20658742123960594
875029 | green | 0.20664456413189736
(15 rows)
Without an index, this query is slow because it calculates the distance for each row that meets the 'color' predicate, sorts them by this distance, and retrieves the Top-15 results, but it has the advantage of providing an exact result:
postgres=# explain (analyze, buffers, costs off, summary on)
select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
order by enn_cosine limit 15
;
QUERY PLAN
-------------------------------------------------------------------
Limit (actual time=1868.024..1878.636 rows=15 loops=1)
Buffers: shared hit=1989174 read=692354
-> Gather Merge (actual time=1868.022..1878.632 rows=15 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1989174 read=692354
-> Sort (actual time=1853.062..1853.063 rows=13 loops=3)
Sort Key: ((embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector))
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=1989174 read=692354
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on embeddings_table (actual time=0.126..1797.436 rows=222222 loops=3)
Filter: (color = 'green'::text)
Rows Removed by Filter: 444444
Buffers: shared hit=1989107 read=692347
Planning:
Buffers: shared read=1
Planning Time: 0.124 ms
Execution Time: 1878.658 ms
PostgreSQL utilized a parallel degree of 3. A filter on "color" reduced the number of rows to 222222 per process, resulting in 666666 rows. This filter eliminated 444444 in each worker, which accounts for two-thirds of the total rows. Each process calculated distances for its assigned rows and sorted them accordingly. Finally, the coordinator gathered the top 15 results from the worker processes.
Approximate Nearest Neighbors (ANN) with index
I enable the index to get a faster, but approximate, result:
postgres=# set enable_indexscan to on;
SET
postgres=# explain (analyze, buffers, costs off, summary on)
select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
where color='green'
order by ann_cosine limit 15
;
QUERY PLAN
-------------------------------------------------------------------
Limit (actual time=5.605..5.916 rows=11 loops=1)
Buffers: shared hit=84 read=1470
-> Index Scan using i1 on embeddings_table (actual time=5.604..5.912 rows=11 loops=1)
Order By: (embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector)
Filter: (color = 'green'::text)
Rows Removed by Filter: 29
Buffers: shared hit=84 read=1470
Planning:
Buffers: shared read=1
Planning Time: 0.089 ms
Execution Time: 5.934 ms
The index was used to retrieve rows in their cosine similarity order related to my reference point (Order By: (embedding <=> '[...]'::vector)), but the search was limited to 40 candidates (the default hnsw.ef_search). 21 rows were discarded by the metadata filter ((color = 'green'::text)), leaving 11 rows remaining (rows=11). Because of this, I didn't have enough candidates for the expected result (limit 15) and I get less rows than expected:
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
where color='green'
order by ann_cosine limit 15
;
id | color | ann_cosine
---------+-------+---------------------
1875277 | green | 0.2076671534464677
222817 | green | 0.21016644773554916
1375990 | green | 0.2118530398002575
21538 | green | 0.21207386707694031
1355350 | green | 0.2121940467579876
505036 | green | 0.21220934429072225
1570702 | green | 0.21469847813732257
1997836 | green | 0.21482420378988654
1195270 | green | 0.21613844835346685
634417 | green | 0.2172001587963871
1995160 | green | 0.21794015870874028
(11 rows)
I used the default hnsw.iterative_scan set to off. Enabling iterative scan will get more results by running the scan again until there's enough candidate for limit 15.
Compare ENN and ANN
The nearest green neighbor found by the index has a cosine similarity of 0.2076671534464677, but the exact search identified fifteen closer green points. This discrepancy highlights a characteristic of Hierarchical Navigable Small Worlds (HNSW). The index is constructed with multiple layers, beginning with sparse upper layers that serve as entry points. These layers help restrict the search scope as it descends into the dense bottom layer (layer 0). While this layered design accelerates search times, it can also miss closer neighbors in layer 0 that were not explored when descending from the upper layers.
Since all points in my dataset were generated using random() values, it lacks natural clustering. This diminishes routing efficiency and increases the likelihood of missing closer neighbors during approximate searches. Therefore, I am illustrating the worst-case scenario.
I checked how many results the approximate search missed by disabling the pgvector index. I found that 23 results were missed before the first row from exact search, which is the 24th row in the following:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
where color='green'
and embedding <=> :'random_embedding' <= 0.2076671534464677
order by enn_cosine
;
id | color | enn_cosine
---------+-------+---------------------
1428352 | green | 0.19814620075833056
328933 | green | 0.2024464516951111
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
1328614 | green | 0.20373734017866685
327802 | green | 0.20464172025637872
238738 | green | 0.2048113256211399
969943 | green | 0.20566046923407266
1924495 | green | 0.2059847615560182
486043 | green | 0.20615577737388402
1911601 | green | 0.20652312839386933
1777339 | green | 0.20658742123960594
875029 | green | 0.20664456413189736
593119 | green | 0.2066683273490607
1354432 | green | 0.20685417261064953
1898782 | green | 0.20697419915308368
1429552 | green | 0.20704169544999784
1293397 | green | 0.20746811422822542
1371502 | green | 0.20746937923342468
998884 | green | 0.2074836628885286
845659 | green | 0.20759016691317878
1875277 | green | 0.2076671534464677
(24 rows)
The measure for this approximation accuracy is called 'recall'. The definition from the MongoDB glossary is:
Recall measures the fraction of true nearest neighbors that were returned by an ANN search. This measure reflects how close the algorithm approximates the results of ENN search.
This approximation applies to all index searches, which are faster than full scans but may miss some closer neighbors. Post-filtering reduces recall even further, as some candidates are discarded, leading to the possibility of missing good matches. That's why it is better to use pre-filtering on large databases. Although pre-filtering is not available in PostgreSQL pgvector, we can analyze the data to understand its potential impact.
Post-filtering impact on recall
The problem is that pgvector lacks filtering capabilities in the index. As a result, it defaults to selecting 40 candidates, as defined by the ef_search parameter, and filtering on more columns, like "color', reduces the result.
To explain this, I've run the query without the filter, showing the first 40 candidates of the three colors:
postgres=# set enable_indexscan to on;
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
-- where color='green'
order by ann_cosine limit 40 -- current_setting('hnsw.ef_search')
;
id | color | ann_cosine
---------+-------+---------------------
1360308 | red | 0.20296796169334463
1732580 | blue | 0.20459117837053364
1085082 | red | 0.20626595570441808
1875277 | green | 0.2076671534464677
656874 | red | 0.20930762441475093
504242 | blue | 0.20984078446453025
247001 | blue | 0.20995935279258404
222817 | green | 0.21016644773554916
705833 | blue | 0.2102792157006329
1966310 | blue | 0.21028852384517327
1503321 | red | 0.21044019511424406
480572 | blue | 0.21100294080666748
1375990 | green | 0.2118530398002575
21538 | green | 0.21207386707694031
1355350 | green | 0.2121940467579876
505036 | green | 0.21220934429072225
106983 | red | 0.21293893184210688
238458 | red | 0.21295064471740388
1008380 | blue | 0.21301481665902566
298931 | blue | 0.21304336639331967
1147263 | red | 0.21342607115241874
562604 | blue | 0.2135493812411281
1570702 | green | 0.21469847813732257
1997836 | green | 0.21482420378988654
1291274 | blue | 0.2159207514557735
1195270 | green | 0.21613844835346685
1035932 | blue | 0.21623180106532514
1010237 | blue | 0.2164365008134519
1256966 | blue | 0.21652825716033564
1748231 | blue | 0.21676377376711842
634417 | green | 0.2172001587963871
1685391 | red | 0.21723542532805584
1964021 | blue | 0.21723845625858207
1056446 | blue | 0.21757530726298147
958670 | blue | 0.21769898462687431
1558046 | blue | 0.2177076235462454
516734 | blue | 0.21777311307937175
1995160 | green | 0.21794015870874028
228096 | red | 0.21866579506700412
660161 | blue | 0.2187276449697918
(40 rows)
This is what the query using the index did in a first step. If you keep only the green rows, you get 11 rows. If it was filtered before, the index scan would have returned 40 green rows and the query would have been able to return the Top-15 from it. Post-filtering misses some good green candidates that were ignored because some others where selected.
It would have been better to pre-filter during the approximate nearest neighbor search to consider only 'green' neighbors, rather than discarding non-matching ones afterward based on a post-filter on 'color'. However, pgvector does not support such pre-filtering. The consequence is a low recall in a single index scan.
Without filter - ENN and ANN
For queries without a filter, recall is generally better because, although approximate searches may miss some points, all candidates returned by the index are included in the results. Here are the Top 15 most similar points across all colors:
postgres=# set enable_indexscan to off;
SET
postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
from embeddings_table
--where color='green'
order by enn_cosine limit 15;
id | color | enn_cosine
---------+-------+---------------------
1506704 | blue | 0.1943345774574703
1428352 | green | 0.19814620075833056
905583 | red | 0.1986930398354949
1887345 | red | 0.19958922153843262
1408551 | red | 0.20155542317891084
1761962 | blue | 0.20168765608150285
91001 | blue | 0.20206633541960917
328933 | green | 0.2024464516951111
493388 | blue | 0.20277316748365937
1360308 | red | 0.20296796169334463
1261723 | green | 0.2031157228085848
1836052 | green | 0.20319815669479213
816615 | red | 0.20350817237259144
1536328 | green | 0.20353639191885098
1221802 | green | 0.20355073458778694
(15 rows)
Here is the same using an index:
postgres=# set enable_indexscan to on;
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
from embeddings_table
--where color='green'
order by ann_cosine limit 15;
id | color | ann_cosine
---------+-------+---------------------
1360308 | red | 0.20296796169334463
1732580 | blue | 0.20459117837053364
1085082 | red | 0.20626595570441808
1875277 | green | 0.2076671534464677
by Franck Pachot