a curated list of database news from authoritative sources

June 30, 2025

June 28, 2025

Flush to disk on commit šŸ‘‰šŸ» MongoDB durable writes

A Twitter (š•) thread was filled with misconceptions about MongoDB, spreading fear, uncertainty, and doubt (FUD). This led one user to question whether MongoDB acknowledges writes before they are actually flushed to disk:

Doesn't MongoDB acknowledge writes before it's actually flushed them to disk?

MongoDB, like many databases, employs journaling—also known as write-ahead logging (WAL)—to ensure durability (the D in ACID) with high performance. This involves safely recording write operations in the journal, and ensuring they are flushed to disk before the commit is acknowledged. Further details can be found in the documentation under Write Concern and Journaling

Here is how you can test it, in a lab, with Linux STRACE and GDB, to debunk the myths.

Start the lab

I created a local MongoDB server. I uses a single-node local atlas cluster here but you can do the same with replicas:

atlas deployments setup  atlas --type local --port 27017 --force

Start it if it was stopped, and connect with MongoDB Shell:

atlas deployment start atlas
mongosh

Trace the system calls with strace

In another terminal, I used strace to display the system calls (-e trace) to write (pwrite64) and sync (fdatasync) the files, with the file names (-yy), by the MongoDB server process (-p $(pgrep -d, mongod)) and its threads (-f), with the execution time and timestamp (-tT):

strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0

Some writes and sync happen in the background

[pid 2625869] 08:26:13 fdatasync(11</data/db/WiredTiger.wt>) = 0 <0.000022>                                                                                    
[pid 2625869] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19072) = 384 <0.000024>                                             
[pid 2625869] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002123>                                                                 
[pid 2625868] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 128, 19456) = 128 <0.000057>                                             
[pid 2625868] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002192>                                                                 
[pid 2625868] 08:26:23 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19584) = 384 <0.000057>                                             
[pid 2625868] 08:26:23 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002068>                                                                 
[pid 2625868] 08:26:33 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19968) = 384 <0.000061>                                             
[pid 2625868] 08:26:33 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002747>                                                                 
[pid 2625868] 08:26:43 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20352) = 384 <0.000065>                                             
[pid 2625868] 08:26:43 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.003008>                                                                 
[pid 2625868] 08:26:53 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20736) = 384 <0.000075>                                             
[pid 2625868] 08:26:53 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002092>                                                                 
[pid 2625868] 08:27:03 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 21120) = 384 <0.000061>                                             
[pid 2625868] 08:27:03 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002527>                                                                 
[pid 2625869] 08:27:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.000033>                                                                 

Write to the collection

In the MongoDB shell, I created a collection and ran ten updates:

db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });

for (let i = 1; i <= 10; i++) {
 db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
 print(` ${i} ${new Date()}`)
}

The strace output the following when running the loop of ten updates:

[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76288) = 512 <0.000066>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001865>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76800) = 512 <0.000072>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77312) = 512 <0.000056>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001641>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77824) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78336) = 512 <0.000175>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001944>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78848) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001829>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79360) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001917>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79872) = 512 <0.000050>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002260>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80384) = 512 <0.000035>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001940>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80896) = 512 <0.000054>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001984>                                                                 

Each write (pwrite64) to the journal files was followed by a sync to disk (fdatasync). This system call is well documented:

FSYNC(2)                                                         Linux Programmer's Manual                                                         FSYNC(2)

NAME
       fsync, fdatasync - synchronize a file's in-core state with storage device

DESCRIPTION
       fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to
       the disk device (or other permanent storage device) so that all changed information can be retrieved even if the  system  crashes  or  is  rebooted.
       This includes writing through or flushing a disk cache if present.  The call blocks until the device reports that the transfer has completed.
...
       fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to  be  correctly  handled.   For  example,  changes  to  st_atime or st_mtime (respectively, time of last access and time of last modification
...
       The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.

Since I display both the committed time and the system call trace times, you can see that they match. The output related to the traces above demonstrates this alignment:

 1 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 2 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 3 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 4 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 5 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 6 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 7 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 8 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
 9 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
 10 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)

Multi-document transactions

The previous example ran ten autocommit updates, each calling a synchronisation to disk.
In general, with good document data modeling, a document should match the business transaction. However, it is possible to use multi-document transaction and they are ACID (atomic, consistent, isolated and durable). Using multi-document transactions also reduces the sync latency as it is required only once per transaction, at commit.

I've run the following with five transactions, each running one update and one insert:


const session = db.getMongo().startSession();
for (let i = 1; i <= 5; i++) {
 session.startTransaction();
  const sessionDb = session.getDatabase(db.getName());
  sessionDb.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
  print(` ${i} updated ${new Date()}`)
  sessionDb.mycollection.insertOne( { answer:42 });
  print(` ${i} inserted ${new Date()}`)
 session.commitTransaction();
 print(` ${i} committed ${new Date()}`)
}

Strace still shows ten calls to pwrite64 and fdatasync. I used this multi-document transaction to go further and prove that not only the commit triggers a sync to disk, but also waits for its acknlowledgement before returning a sucessful feedback to the application.

Inject some latency with gdb

To show that the commit waits for the acknowledgment of fdatasync I used a GDB breakpoint for the fdatasyc call.

I stopped strace, and started GDB with a script that adds a latency of five seconds to fdatasync:

cat > gdb_slow_fdatasync.gdb <<GDB

break fdatasync
commands
  shell sleep 5
  continue
end
continue

GDB

gdb --batch -x gdb_slow_fdatasync.gdb -p $(pgrep mongod)

I ran the five transactions and two writes. GDB shows when it hits the breakpoint:

Thread 31 "JournalFlusher" hit Breakpoint 1, 0x0000ffffa6096eec in fdatasync () from target:/lib64/libc.so.6 

My GDB script automatically waits fives seconds and continues the program, until the next call to fdatasync.

Here was the output from my loop with five transactions:

 1 updated Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
 1 inserted Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
 1 committed Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 updated Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 inserted Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 committed Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 updated Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 inserted Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 committed Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 updated Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 inserted Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 committed Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
 5 updated Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
 5 inserted Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)

The insert and update operations occur immediately, but the commit itself waits five seconds, because of the latency I injected with GDB. This demonstrates that the commit waits for fdatasync, guaranteeing the flush to persistent storage. For this demo, I used all default settings in MongoDB 8.0, but this behavior can still be tuned through write concern and journaling configurations.

Next time you encounter claims from ignorants people or detractors suggesting that MongoDB is not consistent or fails to flush committed changes to disk, you can confidently debunk these myths by referring to official documentation and conducting your own experiments.

Flush to disk on commit šŸ‘‰šŸ» MongoDB durable writes

A Twitter (š•) thread was filled with misconceptions about MongoDB, spreading fear, uncertainty, and doubt (FUD). This led one user to question whether MongoDB acknowledges writes before they are actually flushed to disk:

Doesn't MongoDB acknowledge writes before it's actually flushed them to disk?

MongoDB, like many databases, employs journaling—also known as write-ahead logging (WAL)—to ensure durability (the D in ACID) with high performance. This involves safely recording write operations in the journal, and ensuring they are flushed to disk before the commit is acknowledged. Further details can be found in the documentation under Write Concern and Journaling

Here is how you can test it, in a lab, with Linux STRACE and GDB, to debunk the myths.

Start the lab

I created a local MongoDB server. I uses a single-node local atlas cluster here but you can do the same with replicas:

atlas deployments setup  atlas --type local --port 27017 --force

Start it if it was stopped, and connect with MongoDB Shell:

atlas deployment start atlas
mongosh

Trace the system calls with strace

In another terminal, I used strace to display the system calls (-e trace) to write (pwrite64) and sync (fdatasync) the files, with the file names (-yy), by the MongoDB server process (-p $(pgrep -d, mongod)) and its threads (-f), with the execution time and timestamp (-tT):

strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0

Some writes and sync happen in the background

[pid 2625869] 08:26:13 fdatasync(11</data/db/WiredTiger.wt>) = 0 <0.000022>                                                                                    
[pid 2625869] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19072) = 384 <0.000024>                                             
[pid 2625869] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002123>                                                                 
[pid 2625868] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 128, 19456) = 128 <0.000057>                                             
[pid 2625868] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002192>                                                                 
[pid 2625868] 08:26:23 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19584) = 384 <0.000057>                                             
[pid 2625868] 08:26:23 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002068>                                                                 
[pid 2625868] 08:26:33 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19968) = 384 <0.000061>                                             
[pid 2625868] 08:26:33 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002747>                                                                 
[pid 2625868] 08:26:43 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20352) = 384 <0.000065>                                             
[pid 2625868] 08:26:43 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.003008>                                                                 
[pid 2625868] 08:26:53 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20736) = 384 <0.000075>                                             
[pid 2625868] 08:26:53 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002092>                                                                 
[pid 2625868] 08:27:03 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 21120) = 384 <0.000061>                                             
[pid 2625868] 08:27:03 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002527>                                                                 
[pid 2625869] 08:27:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.000033>                                                                 

Write to the collection

In the MongoDB shell, I created a collection and ran ten updates:

db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });

for (let i = 1; i <= 10; i++) {
 print(` ${i} ${new Date()}`)
 db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
 print(` ${i} ${new Date()}`)
}

The strace output the following when running the loop of ten updates:

[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76288) = 512 <0.000066>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001865>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76800) = 512 <0.000072>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77312) = 512 <0.000056>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001641>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77824) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78336) = 512 <0.000175>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001944>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78848) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001829>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79360) = 512 <0.000043>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001917>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79872) = 512 <0.000050>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002260>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80384) = 512 <0.000035>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001940>                                                                 
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80896) = 512 <0.000054>                                             
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001984>                                                                 

Each write (pwrite64) to the journal files was followed by a sync to disk (fdatasync). This system call is well documented:

FSYNC(2)                                                         Linux Programmer's Manual                                                         FSYNC(2)

NAME
       fsync, fdatasync - synchronize a file's in-core state with storage device

DESCRIPTION
       fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to
       the disk device (or other permanent storage device) so that all changed information can be retrieved even if the  system  crashes  or  is  rebooted.
       This includes writing through or flushing a disk cache if present.  The call blocks until the device reports that the transfer has completed.
...
       fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to  be  correctly  handled.   For  example,  changes  to  st_atime or st_mtime (respectively, time of last access and time of last modification
...
       The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.

Since I display both the committed time and the system call trace times, you can see that they match. The output related to the traces above demonstrates this alignment:

 1 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 2 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 3 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 4 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 5 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 6 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 7 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)                                                                                                    
 8 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
 9 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
 10 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)

Multi-document transactions

The previous example ran ten autocommit updates, each calling a synchronisation to disk.
In general, with good document data modeling, a document should match the business transaction. However, it is possible to use multi-document transaction and they are ACID (atomic, consistent, isolated and durable). Using multi-document transactions also reduces the sync latency as it is required only once per transaction, at commit.

I've run the following with five transactions, each running one update and one insert:


const session = db.getMongo().startSession();
for (let i = 1; i <= 5; i++) {
 session.startTransaction();
  const sessionDb = session.getDatabase(db.getName());
  sessionDb.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
  print(` ${i} updated ${new Date()}`)
  sessionDb.mycollection.insertOne( { answer:42 });
  print(` ${i} inserted ${new Date()}`)
 session.commitTransaction();
 print(` ${i} committed ${new Date()}`)
}

Strace still shows ten calls to pwrite64 and fdatasync. I used this multi-document transaction to go further and prove that not only the commit triggers a sync to disk, but also waits for its acknlowledgement before returning a sucessful feedback to the application.

Inject some latency with gdb

To show that the commit waits for the acknowledgment of fdatasync I used a GDB breakpoint for the fdatasyc call.

I stopped strace, and started GDB with a script that adds a latency of five seconds to fdatasync:

cat > gdb_slow_fdatasync.gdb <<GDB

break fdatasync
commands
  shell sleep 5
  continue
end
continue

GDB

gdb --batch -x gdb_slow_fdatasync.gdb -p $(pgrep mongod)

I ran the five transactions and two writes. GDB shows when it hits the breakpoint:

Thread 31 "JournalFlusher" hit Breakpoint 1, 0x0000ffffa6096eec in fdatasync () from target:/lib64/libc.so.6 

My GDB script automatically waits fives seconds and continues the program, until the next call to fdatasync.

Here was the output from my loop with five transactions:

 1 updated Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
 1 inserted Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
 1 committed Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 updated Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 inserted Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
 2 committed Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 updated Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 inserted Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
 3 committed Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 updated Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 inserted Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
 4 committed Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
 5 updated Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
 5 inserted Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)

The insert and update operations occur immediately, but the commit itself waits five seconds, because of the latency I injected with GDB. This demonstrates that the commit waits for fdatasync, guaranteeing the flush to persistent storage. For this demo, I used all default settings in MongoDB 8.0, but this behavior can still be tuned through write concern and journaling configurations.

I used GDB to examine the call stack. Alternatively, you can inject a delay with strace by adding this option: -e inject=fdatasync:delay_enter=5000000.

Look at the open source code

When calling fdatasync, errors can occur, and this may compromise durability if operations on the file descriptor continue (remember the PostgreSQL fsyncgate). MongoDB uses the open-source WiredTiger storage engine, which implemented the same solution as PostgreSQL to avoid that: panic instead of retry. You can review the os_fs.c code to verify this.

The fdatasync call is in the JournalFlusher thread and here is the backtrace:

#0  0x0000ffffa0b5ceec in fdatasync () from target:/lib64/libc.so.6
#1  0x0000aaaadf5312c0 in __posix_file_sync ()
#2  0x0000aaaadf4f53c8 in __log_fsync_file ()
#3  0x0000aaaadf4f58d4 in __wt_log_force_sync ()
#4  0x0000aaaadf4fb8b8 in __wt_log_flush ()
#5  0x0000aaaadf588348 in __session_log_flush ()
#6  0x0000aaaadf41b878 in mongo::WiredTigerSessionCache::waitUntilDurable(mongo::OperationContext*, mongo::WiredTigerSessionCache::Fsync, mongo::WiredTigerSessionCache::UseJournalListener) ()
#7  0x0000aaaadf412358 in mongo::WiredTigerRecoveryUnit::waitUntilDurable(mongo::OperationContext*) ()
#8  0x0000aaaadfbe855c in mongo::JournalFlusher::run() ()

Here are some entrypoints if you want to look at the code behind this:

Have your opinions based on facts, not myths.

MongoDB began as a NoSQL database that prioritized availability and low latency over strong consistency. However, that was over ten years ago. As technology evolves, experts who refuse to constantly learn risk their knowledge becoming outdated, their skills diminishing, and their credibility suffering.
Today, MongoDB is a general-purpose database that supports transaction atomicity, consistency, isolation, and durability—whether the transaction involves a single document, or multiple documents.

Next time you encounter claims from ignorants or detractors suggesting that MongoDB is not consistent or fails to flush committed changes to disk, you can confidently debunk these myths by referring to official documentation, the open source code, and conducting your own experiments. MongoDB is similar to PostgreSQL: buffered writes and WAL sync to disk on commit.

Want to meet people, try charging them for it?

I have been blogging consistently since 2017. And one of my goals in speaking publicly was always to connect with like-minded people. I always left my email and hoped people would get in touch. Even while my blog and twitter became popular, passing 1M views and 20k followers, I basically never had people get in touch to chat or meet up.

So it felt kind of ridiculous when last November I started charging people $100 to chat. I mean, who am I? But people started showing up fairly immediately. Now granted the money did not go to me. It went to an education non-profit and I merely received the receipt.

And at this point I've met a number of interesting people, from VCs to business professors to undergraduate students to founders and everyone in between. People wanting to talk about trends in databases, about how to succeed as a programmer, about marketing for developers, and so on. Women and men thoughout North America, Europe, Africa, New Zealand, India, Nepal, and so on. And I've raised nearly $6000 for educational non-profits.

How is it that you go from giving away your time for free and getting no hits to charging and almost immediately getting results? For one, every person responded very positively to it being a fundraiser. It also helps me be entirely shameless about sharing on social media every single time someone donates; because it's such a positive thing.

But also I think that in "charging" for my time it helps people feel more comfortable about actually taking my time, especially when we have never met. It gives you a reasonable excuse to take time from an internet rando.

On the other hand, a lot of people come for advice and I think giving advice is pretty dangerous, especially since my background is not super conventional. I try to always frame things as just sharing my opinion and my perspective and that they should talk with many others and not take my suggestions without consideration.

And there's also the problem that by charging everyone for my time now, I'm no longer available to people who could maybe use it the most. I do mention on my page that I will still take calls from people who don't donate, as my schedule allows. But to be honest I feel less incentivized to spend time when people do not donate. So I guess this is an issue with the program.

But I mitigated even this slightly, and significantly jump-started the program, during my 30th birthday when I took calls with any person who donated at least $30.

Anyway, I picked this path because I have wanted to get involved with helping students figure out their lives and careers. But without a degree I am literally unqualified for many volunteering programs. And I always found the time commitments for non-profits painful.

So until starting this I figured it wouldn't be until I retire that I find some way to make a difference. But ultimately I kept meeting people who were starting their own non-profits now or donated significantly to help students. Peer pressure. I wanted to do my part now. And 30 minutes of my time in return for a donation receipt has been an easy trade.

While only raising a humble $6,000 to date, the Chat for Education program has been more successful than I imagined. I've met many amazing people through it. And it's something that should be easy to keep up indefinitely.

I hope to meet you through it too!

June 27, 2025

Supercharging AWS database development with AWS MCP servers

Amazon Aurora, Amazon DynamoDB, and Amazon ElastiCache are popular choices for developers powering critical workloads, including global commerce platforms, financial systems, and real-time analytics applications. To enhance productivity, developers are supplementing everyday tasks with AI-assisted tools that understand context, suggest improvements, and help reason through system configurations. Model Context Protocol (MCP) is at the helm of this revolution, rapidly transforming how developers integrate AI assistants into their development pipelines. In this post, we explore the core concepts behind MCP and demonstrate how new AWS MCP servers can accelerate your database development through natural language prompts.

Managing PostgreSQL on Kubernetes with Percona Everest’s REST API

I’ve been working with Kubernetes and databases for the past few months, and I’m enjoying learning and exploring more about Percona Everest’s features. Percona Everest is a free, open source tool that makes it easier for teams to manage databases in the cloud. In a Cloud Native world, everything is programmable, including databases. Percona Everest […]

June 26, 2025

Percona XtraDB Cluster: Our Commitment to Open Source High Availability

At Percona, we’ve always been dedicated to providing robust, open source solutions that meet our users’ evolving needs. Percona XtraDB Cluster (PXC) stands as a testament to this commitment, offering a highly available and scalable solution for your MySQL and Percona Server for MySQL deployments. We understand that database high availability is critical for your […]

Scaling Smarter: What You Have Missed in MongoDB 8.0

MongoDB has always made it relatively easy to scale horizontally, but with version 8.0, the database takes a significant step forward. If you’re working with large datasets or high-throughput applications, some of the changes in this release will make your life a lot easier — and your architecture cleaner. Let’s take a look at some […]

June 25, 2025

Building a job search engine with PostgreSQL’s advanced search features

In today’s employment landscape, job search platforms play a crucial role in connecting employers with potential candidates. Behind these platforms lie complex search engines that must process and analyze vast amounts of structured and unstructured data to deliver relevant results. This post explores how to use PostgreSQL’s search features to build an effective job search engine. We examine each search capability in detail, discuss how they can be combined in PostgreSQL, and offer strategies for optimizing performance as your search engine scales.

Using Percona Everest Operator CRDs to Manage Databases in Kubernetes

Percona Everest is a free and open source tool for running and managing databases like PostgreSQL, MySQL, and MongoDB inside Kubernetes. It simplifies things by providing three ways to work with your databases: a web interface (UI), a set of commands (API), and direct access through Kubernetes itself using built-in tools like kubectl. > Note: […]

Build a Personalized AI Assistant with Postgres

Learn how to build a Supabase powered AI assistant that combines PostgreSQL with scheduling and external tools for long-term memory, structured data management and autonomous actions.

Use CedarDB to search the CedarDB docs and blogs

Motivation

Not so long ago, I shared that I have an interest in finding things and, in that case, the question was about where something could be found. Another common requrement is, given some expression of an interest, finding the set of documents that best answers the question. For example, coupled with the geospatial question, we might include that we’re looking for Indian restaurants within the specified geographic area.

For this article, though, we’ll restrict the focus to the problem of finding the most relevant documents within some collection, where that collection just happens to be the CedarDB documentation. To that end, I’ll assert up front that my query “Does the CedarDB ‘asof join’ use an index?” should return a helpful response, while the query “Does pickled watermelon belong on a taco?” should ideally return an empty result.

June 24, 2025

No pre-filtering in pgvector means reduced ANN recall

AI applications are expanding rapidly, and PostgreSQL is a popular choice among relational databases. The pgvector extension, a third-party add-on, enhances PostgreSQL by introducing a high-dimensional vector data type with similarity operations and search indexing.
Integrating embeddings directly into general-purpose databases eliminates the need for a separate one. Typically, approximate searches on embeddings are performed alongside exact searches on various other attributes, SQL columns or document fields, such as metadata, dates, or other dimensions.
PostgreSQL offers various index types, but it has notable limitations when combining them, as we have seen in PostgreSQL JSONB Indexing Limitations with B-Tree and GIN. Likewise, pgvector encounters similar issues.
Some users have moved to MongoDB Atlas Vector Search because it offers pre-filtering capabilities. They had incomplete results with PostgreSQL pgvector when filtering with other predicates. To better understand the impact of lacking pre-filtering in such scenarios, I built this simple demo.

Setup PostgreSQL with pgvector

I started a pgvector container:

docker run --name pgv -d -e POSTGRES_PASSWORD=franck pgvector/pgvector:0.8.0-pg17
docker exec   -it pgv psql -U postgres

I enable the extension:

create extension if not exists vector;

Importing a synthetic dataset

I create a function to generate a random vector:

create function random_embedding(dimensions int) returns vector as
$$
select
    array(
        select random()::real
        from generate_series(1, dimensions)
    )::vector
$$ language sql;

I create a table to store embeddings ("embedding") with some metadata ("color"):

create table embeddings_table (
    id        bigserial primary key,
    color     text,
    embedding vector(512)
);

I inserted two million rows, each containing a randomly generated 512-dimensional vector, and assigned one of three colors as metadata:

insert into embeddings_table (embedding,color)
 select random_embedding(512)
  ,(array['red', 'green', 'blue'])[generate_series(1, 2000000)%3+1]
;

I used the random() function to ensure data is evenly distributed. The points are positioned in a 512-dimensional space, with one-third of the rows assigned to each color. This is a synthetic dataset, which makes it easier to understand the result.

I create a vector index, HNSW (Hierarchical Navigable Small Worlds), on the embeddings, using cosine similarity:

create index i1 on embeddings_table 
 using hnsw ( embedding vector_cosine_ops )
;

Query example

I generated one more random vector to use in my queries:

select random_embedding(512)
\gset

postgres=# select :'random_embedding';
           ?column?
---------------------------------------------------------------------
[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]

The cosine similarity search will find the points in the table for which the angle to axis is close to the reference point that I've stored in :'random_embedding' variable.

I want to query the 15 points that are most similar to this reference point, but only consider the green category.

The following query filters on green rows (where color='green'), calculates the cosine similarity (embedding <=> :'random_embedding') and filters the nearest 15 points (order by nn_cosine limit 15):

select id , color, embedding <=> :'random_embedding' nn_cosine
 from embeddings_table
 where color='green'
 order by nn_cosine limit 15;

I used "nn_cosine" for the nearest neighbor cosine search. In future queries, I'll use "enn_cosine" or "ann_cosine" depending on whether I expect an exact or approximate result, from a full table scan or an index scan.

Embeddings have too many dimensions for us to visualize easily, but here's an analogy in our three-dimensional world. My dataset is like a soft ball pool with red, green, and blue balls, where each ball's position represents the meaning of the data. Cosine similarity search is akin to pointing a laser from the center of the pool to a reference point, which corresponds to the meaning we are looking for, and identifying balls whose positions form the smallest angles with the laser ray. Post-filtering searches all balls, then discards red and blue balls afterward. Pre-filtering considers only green balls when searching around the laser ray.

Exact Nearest Neighbors (ENN) with full scan

First, I disable the index to get an exact result:

postgres=# set enable_indexscan to off;
SET                                                                                                                                                                                                          

postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
 from embeddings_table
 where color='green'
 order by enn_cosine limit 15
;

   id    | color |     enn_cosine      
---------+-------+---------------------
 1428352 | green | 0.19814620075833056
  328933 | green |  0.2024464516951111
 1261723 | green |  0.2031157228085848
 1836052 | green | 0.20319815669479213
 1536328 | green | 0.20353639191885098
 1221802 | green | 0.20355073458778694
 1328614 | green | 0.20373734017866685
  327802 | green | 0.20464172025637872
  238738 | green |  0.2048113256211399
  969943 | green | 0.20566046923407266
 1924495 | green |  0.2059847615560182
  486043 | green | 0.20615577737388402
 1911601 | green | 0.20652312839386933
 1777339 | green | 0.20658742123960594
  875029 | green | 0.20664456413189736
(15 rows)

Without an index, this query is slow because it calculates the distance for each row that meets the 'color' predicate, sorts them by this distance, and retrieves the Top-15 results, but it has the advantage of providing an exact result:

postgres=# explain (analyze, buffers, costs off, summary on)                                                                                                                                                  
 select id , color, embedding <=> :'random_embedding' enn_cosine                                                                                                                                              
 from embeddings_table                                                                                                                                                                                       
 where color='green'                                                                                                                                                                                         
 order by enn_cosine limit 15
;                                                                                                                                                                               
                            QUERY PLAN
-------------------------------------------------------------------                                                                                                                 
 Limit (actual time=1868.024..1878.636 rows=15 loops=1)
   Buffers: shared hit=1989174 read=692354
   ->  Gather Merge (actual time=1868.022..1878.632 rows=15 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=1989174 read=692354
         ->  Sort (actual time=1853.062..1853.063 rows=13 loops=3)
               Sort Key: ((embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector))
               Sort Method: top-N heapsort  Memory: 26kB
               Buffers: shared hit=1989174 read=692354
               Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on embeddings_table (actual time=0.126..1797.436 rows=222222 loops=3)
                     Filter: (color = 'green'::text)
                     Rows Removed by Filter: 444444
                     Buffers: shared hit=1989107 read=692347
 Planning:
   Buffers: shared read=1
 Planning Time: 0.124 ms
 Execution Time: 1878.658 ms

PostgreSQL utilized a parallel degree of 3. A filter on "color" reduced the number of rows to 222222 per process, resulting in 666666 rows. This filter eliminated 444444 in each worker, which accounts for two-thirds of the total rows. Each process calculated distances for its assigned rows and sorted them accordingly. Finally, the coordinator gathered the top 15 results from the worker processes.

Approximate Nearest Neighbors (ANN) with index

I enable the index to get a faster, but approximate, result:

postgres=# set enable_indexscan to on;                                                                                                                                                                      
SET                                                                                                                                               

postgres=# explain (analyze, buffers, costs off, summary on)                                                                                                                                                  
 select id , color, embedding <=> :'random_embedding' ann_cosine                                                                                                                                              
 from embeddings_table
 where color='green'
 order by ann_cosine limit 15
;                                                                                                                                                                               
                            QUERY PLAN
-------------------------------------------------------------------    
 Limit (actual time=5.605..5.916 rows=11 loops=1)
   Buffers: shared hit=84 read=1470
   ->  Index Scan using i1 on embeddings_table (actual time=5.604..5.912 rows=11 loops=1)
         Order By: (embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector)
         Filter: (color = 'green'::text)
         Rows Removed by Filter: 29
         Buffers: shared hit=84 read=1470
 Planning:
   Buffers: shared read=1
 Planning Time: 0.089 ms
 Execution Time: 5.934 ms

The index was used to retrieve rows in their cosine similarity order related to my reference point (Order By: (embedding <=> '[...]'::vector)), but the search was limited to 40 candidates (the default hnsw.ef_search). 21 rows were discarded by the metadata filter ((color = 'green'::text)), leaving 11 rows remaining (rows=11). Because of this, I didn't have enough candidates for the expected result (limit 15) and I get less rows than expected:

postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
 from embeddings_table
 where color='green'
 order by ann_cosine limit 15
;          
   id    | color |     ann_cosine      
---------+-------+---------------------
 1875277 | green |  0.2076671534464677
  222817 | green | 0.21016644773554916
 1375990 | green |  0.2118530398002575
   21538 | green | 0.21207386707694031
 1355350 | green |  0.2121940467579876
  505036 | green | 0.21220934429072225
 1570702 | green | 0.21469847813732257
 1997836 | green | 0.21482420378988654
 1195270 | green | 0.21613844835346685
  634417 | green |  0.2172001587963871
 1995160 | green | 0.21794015870874028
(11 rows)

I used the default hnsw.iterative_scan set to off. Enabling iterative scan will get more results by running the scan again until there's enough candidate for limit 15.

Compare ENN and ANN

The nearest green neighbor found by the index has a cosine similarity of 0.2076671534464677, but the exact search identified fifteen closer green points. This discrepancy highlights a characteristic of Hierarchical Navigable Small Worlds (HNSW). The index is constructed with multiple layers, beginning with sparse upper layers that serve as entry points. These layers help restrict the search scope as it descends into the dense bottom layer (layer 0). While this layered design accelerates search times, it can also miss closer neighbors in layer 0 that were not explored when descending from the upper layers.
Since all points in my dataset were generated using random() values, it lacks natural clustering. This diminishes routing efficiency and increases the likelihood of missing closer neighbors during approximate searches. Therefore, I am illustrating the worst-case scenario.

I checked how many results the approximate search missed by disabling the pgvector index. I found that 23 results were missed before the first row from exact search, which is the 24th row in the following:

postgres=# set enable_indexscan to off;                                                                                                                                                                      
SET                                                                                                                                               

postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine                                                                                                                                              
 from embeddings_table
   where color='green' 
   and embedding <=> :'random_embedding' <= 0.2076671534464677
 order by enn_cosine
;   

   id    | color |     enn_cosine      
---------+-------+---------------------
 1428352 | green | 0.19814620075833056
  328933 | green |  0.2024464516951111
 1261723 | green |  0.2031157228085848
 1836052 | green | 0.20319815669479213
 1536328 | green | 0.20353639191885098
 1221802 | green | 0.20355073458778694
 1328614 | green | 0.20373734017866685
  327802 | green | 0.20464172025637872
  238738 | green |  0.2048113256211399
  969943 | green | 0.20566046923407266
 1924495 | green |  0.2059847615560182
  486043 | green | 0.20615577737388402
 1911601 | green | 0.20652312839386933
 1777339 | green | 0.20658742123960594
  875029 | green | 0.20664456413189736
  593119 | green |  0.2066683273490607
 1354432 | green | 0.20685417261064953
 1898782 | green | 0.20697419915308368
 1429552 | green | 0.20704169544999784
 1293397 | green | 0.20746811422822542
 1371502 | green | 0.20746937923342468
  998884 | green |  0.2074836628885286
  845659 | green | 0.20759016691317878
 1875277 | green |  0.2076671534464677
(24 rows)

The measure for this approximation accuracy is called 'recall'. The definition from the MongoDB glossary is:

Recall measures the fraction of true nearest neighbors that were returned by an ANN search. This measure reflects how close the algorithm approximates the results of ENN search.

This approximation applies to all index searches, which are faster than full scans but may miss some closer neighbors. Post-filtering reduces recall even further, as some candidates are discarded, leading to the possibility of missing good matches. That's why it is better to use pre-filtering on large databases. Although pre-filtering is not available in PostgreSQL pgvector, we can analyze the data to understand its potential impact.

Post-filtering impact on recall

The problem is that pgvector lacks filtering capabilities in the index. As a result, it defaults to selecting 40 candidates, as defined by the ef_search parameter, and filtering on more columns, like "color', reduces the result.

To explain this, I've run the query without the filter, showing the first 40 candidates of the three colors:

postgres=# set enable_indexscan to on;                                                                                                                                                                      
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
 from embeddings_table
 -- where color='green'
 order by ann_cosine limit 40 -- current_setting('hnsw.ef_search')
;
   id    | color |     ann_cosine
---------+-------+---------------------
 1360308 | red   | 0.20296796169334463
 1732580 | blue  | 0.20459117837053364
 1085082 | red   | 0.20626595570441808
 1875277 | green |  0.2076671534464677
  656874 | red   | 0.20930762441475093
  504242 | blue  | 0.20984078446453025
  247001 | blue  | 0.20995935279258404
  222817 | green | 0.21016644773554916
  705833 | blue  |  0.2102792157006329
 1966310 | blue  | 0.21028852384517327
 1503321 | red   | 0.21044019511424406
  480572 | blue  | 0.21100294080666748
 1375990 | green |  0.2118530398002575
   21538 | green | 0.21207386707694031
 1355350 | green |  0.2121940467579876
  505036 | green | 0.21220934429072225
  106983 | red   | 0.21293893184210688
  238458 | red   | 0.21295064471740388
 1008380 | blue  | 0.21301481665902566
  298931 | blue  | 0.21304336639331967
 1147263 | red   | 0.21342607115241874
  562604 | blue  |  0.2135493812411281
 1570702 | green | 0.21469847813732257
 1997836 | green | 0.21482420378988654
 1291274 | blue  |  0.2159207514557735
 1195270 | green | 0.21613844835346685
 1035932 | blue  | 0.21623180106532514
 1010237 | blue  |  0.2164365008134519
 1256966 | blue  | 0.21652825716033564
 1748231 | blue  | 0.21676377376711842
  634417 | green |  0.2172001587963871
 1685391 | red   | 0.21723542532805584
 1964021 | blue  | 0.21723845625858207
 1056446 | blue  | 0.21757530726298147
  958670 | blue  | 0.21769898462687431
 1558046 | blue  |  0.2177076235462454
  516734 | blue  | 0.21777311307937175
 1995160 | green | 0.21794015870874028
  228096 | red   | 0.21866579506700412
  660161 | blue  |  0.2187276449697918
(40 rows)

This is what the query using the index did in a first step. If you keep only the green rows, you get 11 rows. If it was filtered before, the index scan would have returned 40 green rows and the query would have been able to return the Top-15 from it. Post-filtering misses some good green candidates that were ignored because some others where selected.

It would have been better to pre-filter during the approximate nearest neighbor search to consider only 'green' neighbors, rather than discarding non-matching ones afterward based on a post-filter on 'color'. However, pgvector does not support such pre-filtering. The consequence is a low recall in a single index scan.

Without filter - ENN and ANN

For queries without a filter, recall is generally better because, although approximate searches may miss some points, all candidates returned by the index are included in the results. Here are the Top 15 most similar points across all colors:

postgres=# set enable_indexscan to off;                                                                                                                                                                      
SET    

postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
 from embeddings_table
 --where color='green'
 order by enn_cosine limit 15;

   id    | color |     enn_cosine      
---------+-------+---------------------
 1506704 | blue  |  0.1943345774574703
 1428352 | green | 0.19814620075833056
  905583 | red   |  0.1986930398354949
 1887345 | red   | 0.19958922153843262
 1408551 | red   | 0.20155542317891084
 1761962 | blue  | 0.20168765608150285
   91001 | blue  | 0.20206633541960917
  328933 | green |  0.2024464516951111
  493388 | blue  | 0.20277316748365937
 1360308 | red   | 0.20296796169334463
 1261723 | green |  0.2031157228085848
 1836052 | green | 0.20319815669479213
  816615 | red   | 0.20350817237259144
 1536328 | green | 0.20353639191885098
 1221802 | green | 0.20355073458778694
(15 rows)

Here is the same using an index:

postgres=# set enable_indexscan to on;                                                                                                                                                                      
SET    

postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
 from embeddings_table
 --where color='green'
 order by ann_cosine limit 15;

   id    | color |     ann_cosine      
---------+-------+---------------------
 1360308 | red   | 0.20296796169334463
 1732580 | blue  | 0.20459117837053364
 1085082 | red   | 0.20626595570441808
 1875277 | green |  0.2076671534464677
  
                                        by Franck Pachot
                                    

No pre-filtering in pgvector means reduced ANN recall

AI applications are expanding rapidly, and PostgreSQL is a popular choice among relational databases. The pgvector extension, a third-party add-on, enhances PostgreSQL by introducing a high-dimensional vector data type with similarity operations and search indexing.
Integrating embeddings directly into general-purpose databases eliminates the need for a separate one. Typically, approximate searches on embeddings are performed alongside exact searches on various other attributes, SQL columns or document fields, such as metadata, dates, or other dimensions.
PostgreSQL offers various index types, but it has notable limitations when combining them, as we have seen in PostgreSQL JSONB Indexing Limitations with B-Tree and GIN. Likewise, pgvector encounters similar issues.
Some users have moved to MongoDB Atlas Vector Search because it offers pre-filtering capabilities. They had incomplete results with PostgreSQL pgvector when filtering with other predicates. To better understand the impact of lacking pre-filtering in such scenarios, I built this simple demo.

Setup PostgreSQL with pgvector

I started a pgvector container:

docker run --name pgv -d -e POSTGRES_PASSWORD=franck pgvector/pgvector:0.8.0-pg17
docker exec   -it pgv psql -U postgres

I enable the extension:

create extension if not exists vector;

Importing a synthetic dataset

I create a function to generate a random vector:

create function random_embedding(dimensions int) returns vector as
$$
select
    array(
        select random()::real
        from generate_series(1, dimensions)
    )::vector
$$ language sql;

I create a table to store embeddings ("embedding") with some metadata ("color"):

create table embeddings_table (
    id        bigserial primary key,
    color     text,
    embedding vector(512)
);

I inserted two million rows, each containing a randomly generated 512-dimensional vector, and assigned one of three colors as metadata:

insert into embeddings_table (embedding,color)
 select random_embedding(512)
  ,(array['red', 'green', 'blue'])[generate_series(1, 2000000)%3+1]
;

I used the random() function to ensure data is evenly distributed. The points are positioned in a 512-dimensional space, with one-third of the rows assigned to each color. This is a synthetic dataset, which makes it easier to understand the result.

I create a vector index, HNSW (Hierarchical Navigable Small Worlds), on the embeddings, using cosine similarity:

create index i1 on embeddings_table 
 using hnsw ( embedding vector_cosine_ops )
;

Query example

I generated one more random vector to use in my queries:

select random_embedding(512)
\gset

postgres=# select :'random_embedding';
           ?column?
---------------------------------------------------------------------
[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]

The cosine similarity search will find the points in the table for which the angle to axis is close to the reference point that I've stored in :'random_embedding' variable.

I want to query the 15 points that are most similar to this reference point, but only consider the green category.

The following query filters on green rows (where color='green'), calculates the cosine similarity (embedding <=> :'random_embedding') and filters the nearest 15 points (order by nn_cosine limit 15):

select id , color, embedding <=> :'random_embedding' nn_cosine
 from embeddings_table
 where color='green'
 order by nn_cosine limit 15;

I used "nn_cosine" for the nearest neighbor cosine search. In future queries, I'll use "enn_cosine" or "ann_cosine" depending on whether I expect an exact or approximate result, from a full table scan or an index scan.

Embeddings have too many dimensions for us to visualize easily, but here's an analogy in our three-dimensional world. My dataset is like a soft ball pool with red, green, and blue balls, where each ball's position represents the meaning of the data. Cosine similarity search is akin to pointing a laser from the center of the pool to a reference point, which corresponds to the meaning we are looking for, and identifying balls whose positions form the smallest angles with the laser ray. Post-filtering searches all balls, then discards red and blue balls afterward. Pre-filtering considers only green balls when searching around the laser ray.

Exact Nearest Neighbors (ENN) with full scan

First, I disable the index to get an exact result:

postgres=# set enable_indexscan to off;
SET                                                                                                                                                                                                          

postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
 from embeddings_table
 where color='green'
 order by enn_cosine limit 15
;

   id    | color |     enn_cosine      
---------+-------+---------------------
 1428352 | green | 0.19814620075833056
  328933 | green |  0.2024464516951111
 1261723 | green |  0.2031157228085848
 1836052 | green | 0.20319815669479213
 1536328 | green | 0.20353639191885098
 1221802 | green | 0.20355073458778694
 1328614 | green | 0.20373734017866685
  327802 | green | 0.20464172025637872
  238738 | green |  0.2048113256211399
  969943 | green | 0.20566046923407266
 1924495 | green |  0.2059847615560182
  486043 | green | 0.20615577737388402
 1911601 | green | 0.20652312839386933
 1777339 | green | 0.20658742123960594
  875029 | green | 0.20664456413189736
(15 rows)

Without an index, this query is slow because it calculates the distance for each row that meets the 'color' predicate, sorts them by this distance, and retrieves the Top-15 results, but it has the advantage of providing an exact result:

postgres=# explain (analyze, buffers, costs off, summary on)                                                                                                                                                  
 select id , color, embedding <=> :'random_embedding' enn_cosine                                                                                                                                              
 from embeddings_table                                                                                                                                                                                       
 where color='green'                                                                                                                                                                                         
 order by enn_cosine limit 15
;                                                                                                                                                                               
                            QUERY PLAN
-------------------------------------------------------------------                                                                                                                 
 Limit (actual time=1868.024..1878.636 rows=15 loops=1)
   Buffers: shared hit=1989174 read=692354
   ->  Gather Merge (actual time=1868.022..1878.632 rows=15 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=1989174 read=692354
         ->  Sort (actual time=1853.062..1853.063 rows=13 loops=3)
               Sort Key: ((embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector))
               Sort Method: top-N heapsort  Memory: 26kB
               Buffers: shared hit=1989174 read=692354
               Worker 0:  Sort Method: top-N heapsort  Memory: 26kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 26kB
               ->  Parallel Seq Scan on embeddings_table (actual time=0.126..1797.436 rows=222222 loops=3)
                     Filter: (color = 'green'::text)
                     Rows Removed by Filter: 444444
                     Buffers: shared hit=1989107 read=692347
 Planning:
   Buffers: shared read=1
 Planning Time: 0.124 ms
 Execution Time: 1878.658 ms

PostgreSQL utilized a parallel degree of 3. A filter on "color" reduced the number of rows to 222222 per process, resulting in 666666 rows. This filter eliminated 444444 in each worker, which accounts for two-thirds of the total rows. Each process calculated distances for its assigned rows and sorted them accordingly. Finally, the coordinator gathered the top 15 results from the worker processes.

Approximate Nearest Neighbors (ANN) with index

I enable the index to get a faster, but approximate, result:

postgres=# set enable_indexscan to on;                                                                                                                                                                      
SET                                                                                                                                               

postgres=# explain (analyze, buffers, costs off, summary on)                                                                                                                                                  
 select id , color, embedding <=> :'random_embedding' ann_cosine                                                                                                                                              
 from embeddings_table
 where color='green'
 order by ann_cosine limit 15
;                                                                                                                                                                               
                            QUERY PLAN
-------------------------------------------------------------------    
 Limit (actual time=5.605..5.916 rows=11 loops=1)
   Buffers: shared hit=84 read=1470
   ->  Index Scan using i1 on embeddings_table (actual time=5.604..5.912 rows=11 loops=1)
         Order By: (embedding <=> '[0.899858,0.08105531,0.78641415,0.07696906,0.08429382,...,0.5175713,0.8292444]'::vector)
         Filter: (color = 'green'::text)
         Rows Removed by Filter: 29
         Buffers: shared hit=84 read=1470
 Planning:
   Buffers: shared read=1
 Planning Time: 0.089 ms
 Execution Time: 5.934 ms

The index was used to retrieve rows in their cosine similarity order related to my reference point (Order By: (embedding <=> '[...]'::vector)), but the search was limited to 40 candidates (the default hnsw.ef_search). 21 rows were discarded by the metadata filter ((color = 'green'::text)), leaving 11 rows remaining (rows=11). Because of this, I didn't have enough candidates for the expected result (limit 15) and I get less rows than expected:

postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
 from embeddings_table
 where color='green'
 order by ann_cosine limit 15
;          
   id    | color |     ann_cosine      
---------+-------+---------------------
 1875277 | green |  0.2076671534464677
  222817 | green | 0.21016644773554916
 1375990 | green |  0.2118530398002575
   21538 | green | 0.21207386707694031
 1355350 | green |  0.2121940467579876
  505036 | green | 0.21220934429072225
 1570702 | green | 0.21469847813732257
 1997836 | green | 0.21482420378988654
 1195270 | green | 0.21613844835346685
  634417 | green |  0.2172001587963871
 1995160 | green | 0.21794015870874028
(11 rows)

I used the default hnsw.iterative_scan set to off. Enabling iterative scan will get more results by running the scan again until there's enough candidate for limit 15.

Compare ENN and ANN

The nearest green neighbor found by the index has a cosine similarity of 0.2076671534464677, but the exact search identified fifteen closer green points. This discrepancy highlights a characteristic of Hierarchical Navigable Small Worlds (HNSW). The index is constructed with multiple layers, beginning with sparse upper layers that serve as entry points. These layers help restrict the search scope as it descends into the dense bottom layer (layer 0). While this layered design accelerates search times, it can also miss closer neighbors in layer 0 that were not explored when descending from the upper layers.
Since all points in my dataset were generated using random() values, it lacks natural clustering. This diminishes routing efficiency and increases the likelihood of missing closer neighbors during approximate searches. Therefore, I am illustrating the worst-case scenario.

I checked how many results the approximate search missed by disabling the pgvector index. I found that 23 results were missed before the first row from exact search, which is the 24th row in the following:

postgres=# set enable_indexscan to off;                                                                                                                                                                      
SET                                                                                                                                               

postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine                                                                                                                                              
 from embeddings_table
   where color='green' 
   and embedding <=> :'random_embedding' <= 0.2076671534464677
 order by enn_cosine
;   

   id    | color |     enn_cosine      
---------+-------+---------------------
 1428352 | green | 0.19814620075833056
  328933 | green |  0.2024464516951111
 1261723 | green |  0.2031157228085848
 1836052 | green | 0.20319815669479213
 1536328 | green | 0.20353639191885098
 1221802 | green | 0.20355073458778694
 1328614 | green | 0.20373734017866685
  327802 | green | 0.20464172025637872
  238738 | green |  0.2048113256211399
  969943 | green | 0.20566046923407266
 1924495 | green |  0.2059847615560182
  486043 | green | 0.20615577737388402
 1911601 | green | 0.20652312839386933
 1777339 | green | 0.20658742123960594
  875029 | green | 0.20664456413189736
  593119 | green |  0.2066683273490607
 1354432 | green | 0.20685417261064953
 1898782 | green | 0.20697419915308368
 1429552 | green | 0.20704169544999784
 1293397 | green | 0.20746811422822542
 1371502 | green | 0.20746937923342468
  998884 | green |  0.2074836628885286
  845659 | green | 0.20759016691317878
 1875277 | green |  0.2076671534464677
(24 rows)

The measure for this approximation accuracy is called 'recall'. The definition from the MongoDB glossary is:

Recall measures the fraction of true nearest neighbors that were returned by an ANN search. This measure reflects how close the algorithm approximates the results of ENN search.

This approximation applies to all index searches, which are faster than full scans but may miss some closer neighbors. Post-filtering reduces recall even further, as some candidates are discarded, leading to the possibility of missing good matches. That's why it is better to use pre-filtering on large databases. Although pre-filtering is not available in PostgreSQL pgvector, we can analyze the data to understand its potential impact.

Post-filtering impact on recall

The problem is that pgvector lacks filtering capabilities in the index. As a result, it defaults to selecting 40 candidates, as defined by the ef_search parameter, and filtering on more columns, like "color', reduces the result.

To explain this, I've run the query without the filter, showing the first 40 candidates of the three colors:

postgres=# set enable_indexscan to on;                                                                                                                                                                      
SET
postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
 from embeddings_table
 -- where color='green'
 order by ann_cosine limit 40 -- current_setting('hnsw.ef_search')
;
   id    | color |     ann_cosine
---------+-------+---------------------
 1360308 | red   | 0.20296796169334463
 1732580 | blue  | 0.20459117837053364
 1085082 | red   | 0.20626595570441808
 1875277 | green |  0.2076671534464677
  656874 | red   | 0.20930762441475093
  504242 | blue  | 0.20984078446453025
  247001 | blue  | 0.20995935279258404
  222817 | green | 0.21016644773554916
  705833 | blue  |  0.2102792157006329
 1966310 | blue  | 0.21028852384517327
 1503321 | red   | 0.21044019511424406
  480572 | blue  | 0.21100294080666748
 1375990 | green |  0.2118530398002575
   21538 | green | 0.21207386707694031
 1355350 | green |  0.2121940467579876
  505036 | green | 0.21220934429072225
  106983 | red   | 0.21293893184210688
  238458 | red   | 0.21295064471740388
 1008380 | blue  | 0.21301481665902566
  298931 | blue  | 0.21304336639331967
 1147263 | red   | 0.21342607115241874
  562604 | blue  |  0.2135493812411281
 1570702 | green | 0.21469847813732257
 1997836 | green | 0.21482420378988654
 1291274 | blue  |  0.2159207514557735
 1195270 | green | 0.21613844835346685
 1035932 | blue  | 0.21623180106532514
 1010237 | blue  |  0.2164365008134519
 1256966 | blue  | 0.21652825716033564
 1748231 | blue  | 0.21676377376711842
  634417 | green |  0.2172001587963871
 1685391 | red   | 0.21723542532805584
 1964021 | blue  | 0.21723845625858207
 1056446 | blue  | 0.21757530726298147
  958670 | blue  | 0.21769898462687431
 1558046 | blue  |  0.2177076235462454
  516734 | blue  | 0.21777311307937175
 1995160 | green | 0.21794015870874028
  228096 | red   | 0.21866579506700412
  660161 | blue  |  0.2187276449697918
(40 rows)

This is what the query using the index did in a first step. If you keep only the green rows, you get 11 rows. If it was filtered before, the index scan would have returned 40 green rows and the query would have been able to return the Top-15 from it. Post-filtering misses some good green candidates that were ignored because some others where selected.

It would have been better to pre-filter during the approximate nearest neighbor search to consider only 'green' neighbors, rather than discarding non-matching ones afterward based on a post-filter on 'color'. However, pgvector does not support such pre-filtering. The consequence is a low recall in a single index scan.

Without filter - ENN and ANN

For queries without a filter, recall is generally better because, although approximate searches may miss some points, all candidates returned by the index are included in the results. Here are the Top 15 most similar points across all colors:

postgres=# set enable_indexscan to off;                                                                                                                                                                      
SET    

postgres=# select id , color, embedding <=> :'random_embedding' enn_cosine
 from embeddings_table
 --where color='green'
 order by enn_cosine limit 15;

   id    | color |     enn_cosine      
---------+-------+---------------------
 1506704 | blue  |  0.1943345774574703
 1428352 | green | 0.19814620075833056
  905583 | red   |  0.1986930398354949
 1887345 | red   | 0.19958922153843262
 1408551 | red   | 0.20155542317891084
 1761962 | blue  | 0.20168765608150285
   91001 | blue  | 0.20206633541960917
  328933 | green |  0.2024464516951111
  493388 | blue  | 0.20277316748365937
 1360308 | red   | 0.20296796169334463
 1261723 | green |  0.2031157228085848
 1836052 | green | 0.20319815669479213
  816615 | red   | 0.20350817237259144
 1536328 | green | 0.20353639191885098
 1221802 | green | 0.20355073458778694
(15 rows)

Here is the same using an index:

postgres=# set enable_indexscan to on;                                                                                                                                                                      
SET    

postgres=# select id , color, embedding <=> :'random_embedding' ann_cosine
 from embeddings_table
 --where color='green'
 order by ann_cosine limit 15;

   id    | color |     ann_cosine      
---------+-------+---------------------
 1360308 | red   | 0.20296796169334463
 1732580 | blue  | 0.20459117837053364
 1085082 | red   | 0.20626595570441808
 1875277 | green |  0.2076671534464677
  
                                        by Franck Pachot
                                    

Open Wins Again II: From Fork to Force

Forks aren’t just a technical event. We can look at them as philosophical ruptures, and the open source world isn’t the first to face this. Crypto communities have lived and breathed this reality for over a decade, often in the most chaotic, transparent, and high-stakes ways possible. Now, as Valkey rises to become the de […]

June 23, 2025

Migrate a self-managed MySQL database to Amazon Aurora MySQL using AWS DMS homogeneous data migrations

In this post, we provide a comprehensive, step-by-step guide for migrating an on-premises self-managed encrypted MySQL database to Amazon Aurora MySQL using AWS DMS homogeneous data migrations over a private network. We show a complete end-to-end example of setting up and executing an AWS DMS homogeneous migration, consolidating all necessary configuration steps and best practices.