July 08, 2025
Migrating data from Postgres to Convex
Transparent Data Encryption: The Best Way to Secure Your Data in PostgreSQL
ALTER TABLE ... ADD COLUMN
MongoDB’s flexible schema allows each document within a collection to have a unique structure, a feature known as polymorphism. This contrasts with SQL databases, where every row in a table must adhere to a predefined set of columns. To support polymorphic data models without multiplying tables, SQL schemas often include many optional columns, which are frequently left null. These null values usually signify "not applicable" rather than "unknown". Furthermore, unused columns must be explicitly defined in the database schema, and modifying this structure typically requires locking the table, complicating maintenance.
This rigidity is often a point of comparison, favoring MongoDB, where applications can introduce new schemas without affecting existing objects. A common example was the ALTER TABLE ADD COLUMN statement, which locks the table because DDL operations must modify the catalog information shared by all table rows. It was often mentionned to illustrate the rigidity of RDBMS. When updating existing rows, this lock could last a long time, causing significant delays.
However, while many ALTER TABLE operations still require rewriting the table, such as changing a data type, adding a column is no longer one of them. Most SQL databases now optimize adding a column as a metadata-only operation, making it faster and more efficient than before. The main difference with MongoDB is how this change is controlled by the database administrator or the application developer.
This blog post explains that mentioning ALTER TABLE ADD COLUMN to showcase schema flexibility is not ideal because it has been optimized in many RDBMS. It is the occasion to explain how it works internally in PostgreSQL, and that it is similar to what developers do with MongoDB.
Example on PostgreSQL
Adding a column requires an exclusive lock, but since PostgreSQL 11 this lock duration is now very short (if it doesn't have itself to wait on another lock).
For example, I created a table with ten million rows:
postgres=# \timing on
Timing is on.
postgres=# create table iloveddl ( id bigint );
CREATE TABLE
Time: 2.026 ms
postgres=# insert into iloveddl select generate_series(1,1e7);
INSERT 0 10000000
Time: 31328.019 ms (00:31.328)
PostgreSQL updates are more costly than inserts, so rewriting a table can take several minutes. I use this to verify that my ALTER TABLE statements do not trigger a re-write of the rows.
Here is the first row in the table:
postgres=# select * from iloveddl where id <2;
id
----
1
(1 row)
Time: 839.856 ms
The column information is stored in the catalog and can be viewed in the pg_attribute table. To retrieve details such as the column name, number, nullability, and nullable or missing value, you can run the following SQL query:
select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
from pg_attribute
where attrelid='iloveddl'::regclass
;
attname | attnum | attnotnull | atthasdef | atthasmissing | attmissingval
----------+--------+------------+-----------+---------------+---------------
tableoid | -6 | t | f | f |
cmax | -5 | t | f | f |
xmax | -4 | t | f | f |
cmin | -3 | t | f | f |
xmin | -2 | t | f | f |
ctid | -1 | t | f | f |
id | 1 | f | f | f |
(7 rows)
Time: 0.569 ms
In addition to the system columns, there is an entry for the single column I defined in CREATE TABLE: "id". All flags are set to false: this column is nullable, has no default value, and lacks a value to replace a missing one. In SQL databases, a missing value, indicating no value exists, is different from a null, which signifies an unknown value. The missing value flag is used to optimize adding columns with default values without needing to rewrite all rows.
NULL with no DEFAULT
To add a nullable column without a default value in PostgreSQL, the following SQL command adds a new column named "x" of type integer to the "iloveddl" table. Since no default value is specified and the column is nullable by default, it will contain NULL for existing rows:
postgres=# alter table iloveddl add column x int;
ALTER TABLE
Time: 1.760 ms
This was a quick operation, involving only metadata, without rewriting the rows. Existing rows have no information about this new column, but on query, PostgreSQL adds this column to the result with a null value:
postgres=# select * from iloveddl where id <2;
id | x
----+---
1 |
(1 row)
Time: 206.647 ms
This is a common scenario where the SQL semantics are straightforward: when a new column is added, it logically exists for all rows, but its value is unknown for those created before the column addition. Unknown values are represented as NULL. Prior to the ALTER TABLE command, the column did not exist. Afterward, it exists with an unknown value.
Although the ALTER TABLE ADD COLUMN operation affects all rows logically, it doesn't need to modify physical rows. Instead, the database catalog marks the existence of the new column for all rows, and at runtime, a physical non-existence is interpreted as a logical NULL.
PostgreSQL has simply added the column in the dictionary, as nullable (attnotnull):
select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
from pg_attribute
where attrelid='iloveddl'::regclass
;
attname | attnum | attnotnull | atthasdef | atthasmissing | attmissingval
----------+--------+------------+-----------+---------------+---------------
tableoid | -6 | t | f | f |
cmax | -5 | t | f | f |
xmax | -4 | t | f | f |
cmin | -3 | t | f | f |
xmin | -2 | t | f | f |
ctid | -1 | t | f | f |
id | 1 | f | f | f |
x | 2 | f | f | f |
(8 rows)
Time: 0.510 ms
With a simple insert into the catalog, PostgreSQL changed the definition of all rows, from inexisting "x" to existing with unknown value.
For example, in a small business scenario, you may have stored only customers' names and emails. If you then decide to add a 'date of birth' column, this information was likely always existing but previously unrecorded. After the addition, the date of birth for existing customers will appear as NULL, indicating that the value is unknown.
NULL with DEFAULT
SQL allows an insert to omit a column, setting its value to NULL by default, which indicates the value is unknown at insert time. However, SQL developers can specify that omitted columns should instead take a default value. The column remains nullable, meaning it can be explicitly set to NULL in an insert statement, but the absence of a value is different than a null assignment, and must set the default value.
When adding a column with a default value in PostgreSQL, existing rows are treated as if they were inserted on the table with the new column, but unspecified by the insert statement. The new column must now return the default value for the existing rows. Before PostgreSQL 11, the ALTER TABLE command had to write this default value into every row, which could be slow. This limitation was used to illustrate the rigidity of schema changes in SQL databases.
Currently, this particular case is optimized and performs quickly in the latest versions:
postgres=# alter table iloveddl add column y int null default 42;
ALTER TABLE
Time: 2.802 ms
The default value used for future inserts is stored in the pg_attrdef table. You can retrieve this information with the following SQL query:
select * from pg_attrdef
where adrelid='iloveddl'::regclass
;
oid | adrelid | adnum | adbin
--------+---------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------
346434 | 346431 | 3 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 42 0 0 0 0 0 0 0 ]}
(1 row)
Time: 0.413 ms
This is used for future inserts, but existing rows show the same value, which was virtually set in the catalog to avoid re-writing all rows during the ALTER TABLE statement:
postgres=# select * from iloveddl where id <2;
id | x | y
----+---+----
1 | | 42
(1 row)
Time: 189.657 ms
You can change the default value for future inserts without altering the existing data or the current value of the column. Here is an example:
postgres=# alter table iloveddl alter column y set default 100;
ALTER TABLE
Time: 2.039 ms
postgres=#
postgres=# insert into iloveddl(id) values (-1);
INSERT 0 1
Time: 1.516 ms
postgres=# select * from iloveddl where id <2;
id | x | y
----+---+-----
1 | | 42
-1 | | 100
(2 rows)
Time: 207.727 ms
This change affects only new rows. For example, after setting the default to 100, inserting a row without specifying "y" will automatically assign 100. It does not alter existing rows. For instance, the rows that existed before adding the column will still show their previous "y" values, like 42.
The default value has been changed in pg_attrdef, to 100, which proves that the value for existing rows, 42, is stored elewhere:
postgres=# select * from pg_attrdef where adrelid='iloveddl'::regclass;
oid | adrelid | adnum | adbin
--------+---------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------
346435 | 346431 | 3 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 100 0 0 0 0 0 0 0 ]}
(1 row)
Time: 0.568 ms
PostgreSQL appears to set columns with default values as if all rows have that value, but this is done logically rather than physically. It stores extra information in pg_attribute to assign a default value for columns that are missing when reading a table row, ensuring consistent behavior without physically altering each row:
select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
from pg_attribute
where attrelid='iloveddl'::regclass
;
attname | attnum | attnotnull | atthasdef | atthasmissing | attmissingval
----------+--------+------------+-----------+---------------+---------------
tableoid | -6 | t | f | f |
cmax | -5 | t | f | f |
xmax | -4 | t | f | f |
cmin | -3 | t | f | f |
xmin | -2 | t | f | f |
ctid | -1 | t | f | f |
id | 1 | f | f | f |
x | 2 | f | f | f |
y | 3 | f | t | t | {42}
(9 rows)
Time: 0.529 ms
When PostgreSQL reads a row from a table and finds that the column "y" does not exist, it checks the null bitmap. If "y" is absent there as well, PostgreSQL adds the column to the result with the default value stored in the catalog as attmissingval.
This optimization is only applicable when the default value is a constant. For scenarios like adding an expiration date to passwords—such as forcing users to change their password annually—it makes sense to set a default for existing users to the next year from the current date. This works only if there's the same expiration date for all users.
NOT NULL
It is possible to add a non-nullable column, but a default value is required. Otherwise, it would result in a null in a non-nullable column:
alter table iloveddl add column z int not null
;
ERROR: column "z" of relation "iloveddl" contains null values
Time: 1.024 ms
alter table iloveddl add column z int not null default 42
;
ALTER TABLE
Time: 2.322 ms
This was quick, a metadata-only change. With NOT NULL DEFAULT, either the value is set physically in the row, or it is absent and the value comes from the catalog:
select * from pg_attrdef
where adrelid='iloveddl'::regclass
;
oid | adrelid | adnum | adbin
--------+---------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------
346435 | 346431 | 3 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 100 0 0 0 0 0 0 0 ]}
346436 | 346431 | 4 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 42 0 0 0 0 0 0 0 ]}
(2 rows)
Time: 0.503 ms
select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
from pg_attribute
where attrelid='iloveddl'::regclass
;
attname | attnum | attnotnull | atthasdef | atthasmissing | attmissingval
----------+--------+------------+-----------+---------------+---------------
tableoid | -6 | t | f | f |
cmax | -5 | t | f | f |
xmax | -4 | t | f | f |
cmin | -3 | t | f | f |
xmin | -2 | t | f | f |
ctid | -1 | t | f | f |
id | 1 | f | f | f |
x | 2 | f | f | f |
y | 3 | f | t | t | {42}
z | 4 | t | t | t | {42}
(10 rows)
Time: 0.465 ms
If I insert a new row without mentionning the new column, it is set with the current default value from pg_attrdef, and stored with it. If I query a row that was inserted before, it shows the value from pg_attribute.attmissingval:
insert into iloveddl(id,y) values (-2,null)
;
INSERT 0 1
Time: 8.692 ms
select * from iloveddl where id <2
;
id | x | y | z
----+---+-----+----
1 | | 42 | 42
-1 | | 100 | 42
-2 | | | 42
(3 rows)
Time: 195.648 ms
Summary of PostgreSQL behavior
The DEFAULT clause used in an ALTER TABLE ADD COLUMN statement serves two different purposes:
- Schema on write: When new rows are inserted without specifying a value for this column, the DEFAULT value is automatically assigned, functioning similarly to a trigger or a stored generated column.
- Schema on read: When querying rows that lack a value in this column, the DEFAULT value appears in the result set, similar to a view or a virtual column.
In PostgreSQL, there are three cases when reading a row. First, if a column is present in the null bitmap and flagged as null, the value returned is NULL. Second, if the column is present and flagged as not null, the actual value is returned. Lastly, if the column is not present in the null bitmap, the system returns a predefined value called attmissingval.
Comparison with MongoDB
In MongoDB, some part of the document structure can be defined using indexes and schema validation (schema-on-write). However, the application is free to add new fields to documents without impacting other objects, and interpret it in the application code (schema-on-read).
When a default value is needed, it is the application's responsibility to write it, rather than relying on a rule deployed in the database. If the absence of a field should be interpreted as another value, the application handles this with application code. In general, the logic is more complex than a single constant for all existing data, and may depend on other fields. Aggregation pipeline can code the logic so that it runs efficiently in the database.
I create a collection similar to my PostgreSQL table:
for (let i = 1; i <= 10_000_000; i += 1000)
db.iloveddl.insertMany(Array.from(
{length: 1000},
(_,k) => ({_id: i+k})
))
;
Here is how the first document looks like:
db.iloveddl.find({_id:{$lt:2}})
[ { _id: 1 } ]
The equivalent of ADD COLUMN NULL with no default is simply defining the field for new documents when needed, with nothing to do on existing documents because in MongoDB an inexisting field is the same as null.
New documents can set the field and if there is a default value, the application will set it:
db.iloveddl.insertOne({ _id: -1, y: 100 })
{ acknowledged: true, insertedId: -1 }
db.iloveddl.find({_id:{$lt:2}})
[ { _id: -1, y: 100 }, { _id: 1 } ]
SQL was designed primarily for non-programmers using the command line to manipulate data. In this context, it is important to deploy defaults and check constraints directly into the database. MongoDB is designed for databases where data manipulation is handled through application code, and developers prefer to centralize all logic within the application, where it can be reviewed and tested more easily. Typically, the document inserted into MongoDB comes from an application object, and defaults were set in the constructor.
The equivalent of ADD COLUMN NULL with DEFAULT is like in PostgreSQL, adding the field on read, except that it is done by the application code rather than a declaration in the database catalog:
db.iloveddl.aggregate([
{ $sort: { _id:1 } },
{ $project: {
_id: 1,
y: { $ifNull: [ "$y", 42 ] }
} }
])
[
{ _id: -1, y: 100 }, { _id: 1, y: 42 },
{ _id: 2, y: 42 }, { _id: 3, y: 42 },
{ _id: 4, y: 42 }, { _id: 5, y: 42 },
{ _id: 6, y: 42 }, { _id: 7, y: 42 },
{ _id: 8, y: 42 }, { _id: 9, y: 42 },
{ _id: 10, y: 42 }, { _id: 11, y: 42 },
{ _id: 12, y: 42 }, { _id: 13, y: 42 },
{ _id: 14, y: 42 }, { _id: 15, y: 42 },
{ _id: 16, y: 42 }, { _id: 17, y: 42 },
{ _id: 18, y: 42 }, { _id: 19, y: 42 }
]
Type "it" for more
If the logic for existing documents is more complex, like depending on other fields, it can be done the same way:
db.iloveddl.aggregate([
{ $sort: { _id: 1 } },
{ $project: {
_id: 1,
y: {
$ifNull: [
"$y",
{
$cond: [
{ $eq: [ { $mod: [ "$_id", 2 ] }, 0 ] },
42, // set 42 to "y" when "_id" is even
99 // set 42 to "y" when "_id" is odd
]
}
]
}
}
}
])
Like in PostgreSQL, such schema-on-read transformation can be implemented in the database as a view:
db.createView(
"iloveddl_conditional_y", // Name of the view
"iloveddl", // Source collection
ALTER TABLE ... ADD COLUMN
MongoDB’s flexible schema allows each document within a collection to have a unique structure, a feature known as polymorphism. This contrasts with SQL databases, where every row in a table must adhere to a predefined set of columns. To support polymorphic data models without multiplying tables, SQL schemas often include many optional columns, which are frequently left null. These null values usually signify "not applicable" rather than "unknown". Furthermore, unused columns must be explicitly defined in the database schema, and modifying this structure typically requires locking the table.
This rigidity is often a point of comparison, favoring MongoDB, where applications can introduce new schemas without affecting existing objects. A common example was the ALTER TABLE ADD COLUMN statement, which locks the table because DDL operations must modify the catalog information shared by all table rows. It was often mentionned to illustrate the rigidity of RDBMS. When updating existing rows, this lock could last a long time, causing significant delays.
However, while many ALTER TABLE operations still require rewriting the table, such as changing a data type, adding a column is no longer one of them. Most SQL databases now optimize adding a column as a metadata-only operation, making it faster and more efficient than before. The main difference with MongoDB is how this change is controlled by the database administrator or the application developer.
This blog post explains that mentioning ALTER TABLE ADD COLUMN to showcase schema flexibility is not ideal because it has been optimized in many RDBMS. It is the occasion to explain how it works internally in PostgreSQL, and that it is similar to what developers do with MongoDB.
Example on PostgreSQL
Adding a column requires an exclusive lock, but since PostgreSQL 11 this lock duration is now very short (if it doesn't have itself to wait on another lock).
For example, I created a table with ten million rows:
\timing on
Timing is on.
create table iloveddl ( id bigint );
CREATE TABLE
Time: 2.026 ms
insert into iloveddl select generate_series(1,1e7);
INSERT 0 10000000
Time: 31328.019 ms (00:31.328)
PostgreSQL updates are more costly than inserts, so rewriting a table can take several minutes. I use this to verify that my ALTER TABLE statements do not trigger a re-write of the rows.
Here is the first row in the table:
select * from iloveddl where id <2;
id
----
1
(1 row)
Time: 839.856 ms
The column information is stored in the catalog and can be viewed in the pg_attribute table. To retrieve details such as the column name, number, nullability, and nullable or missing value, you can run the following SQL query:
select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
from pg_attribute
where attrelid='iloveddl'::regclass
;
attname | attnum | attnotnull | atthasdef | atthasmissing | attmissingval
----------+--------+------------+-----------+---------------+---------------
tableoid | -6 | t | f | f |
cmax | -5 | t | f | f |
xmax | -4 | t | f | f |
cmin | -3 | t | f | f |
xmin | -2 | t | f | f |
ctid | -1 | t | f | f |
id | 1 | f | f | f |
(7 rows)
Time: 0.569 ms
In addition to the system columns, there is an entry for the single column I defined in CREATE TABLE: "id". All flags are set to false: this column is nullable, has no default value, and lacks a value to replace a missing one. In SQL databases, a missing value, indicating no value exists, is different from a null, which signifies an unknown value. The missing value flag is used to optimize adding columns with default values without needing to rewrite all rows.
NULL with no DEFAULT
To add a nullable column without a default value in PostgreSQL, the following SQL command adds a new column named "x" of type integer to the "iloveddl" table. Since no default value is specified and the column is nullable by default, it will contain NULL for existing rows:
alter table iloveddl add column x int;
ALTER TABLE
Time: 1.760 ms
This was a quick operation, involving only metadata, without rewriting the rows. Existing rows have no information about this new column, but on query, PostgreSQL adds this column to the result with a null value:
select * from iloveddl where id <2;
id | x
----+---
1 |
(1 row)
Time: 206.647 ms
This is a common scenario where the SQL semantics are straightforward: when a new column is added, it logically exists for all rows, but its value is unknown for those created before the column addition. Unknown values are represented as NULL. Prior to the ALTER TABLE command, the column did not exist. Afterward, it exists with an unknown value.
Although the ALTER TABLE ADD COLUMN operation affects all rows logically, it doesn't need to modify physical rows. Instead, the database catalog marks the existence of the new column for all rows, and at runtime, a physical non-existence is interpreted as a logical NULL.
PostgreSQL has simply added the column in the dictionary, as nullable (attnotnull):
select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
from pg_attribute
where attrelid='iloveddl'::regclass
;
attname | attnum | attnotnull | atthasdef | atthasmissing | attmissingval
----------+--------+------------+-----------+---------------+---------------
tableoid | -6 | t | f | f |
cmax | -5 | t | f | f |
xmax | -4 | t | f | f |
cmin | -3 | t | f | f |
xmin | -2 | t | f | f |
ctid | -1 | t | f | f |
id | 1 | f | f | f |
x | 2 | f | f | f |
(8 rows)
Time: 0.510 ms
With a simple insert into the catalog, PostgreSQL changed the definition of all rows, from inexisting "x" to existing with unknown value.
For example, in a small business scenario, you may have stored only customers' names and emails. If you then decide to add a 'date of birth' column, this information was likely always existing but previously unrecorded. After the addition, the date of birth for existing customers will appear as NULL, indicating that the value is unknown.
NULL with DEFAULT value
SQL allows an insert to omit a column, setting its value to NULL by default, which indicates the value is unknown at insert time. However, SQL developers can specify that omitted columns should instead take a default value. The column remains nullable, meaning it can be explicitly set to NULL in an insert statement, but the absence of a value is different than a null assignment, and must set the default value.
When adding a column with a default value in PostgreSQL, existing rows are treated as if they were inserted on the table with the new column, but unspecified by the insert statement. The new column must now return the default value for the existing rows. Before PostgreSQL 11, the ALTER TABLE command had to write this default value into every row, which could be slow. This limitation was used to illustrate the rigidity of schema changes in SQL databases.
Currently, this particular case is optimized and performs quickly in the latest versions:
alter table iloveddl add column y int null default 42;
ALTER TABLE
Time: 2.802 ms
The default value used for future inserts is stored in the pg_attrdef table. You can retrieve this information with the following SQL query:
select * from pg_attrdef
where adrelid='iloveddl'::regclass
;
oid | adrelid | adnum | adbin
--------+---------+-------+------------------------------------------------------------------------------------------------------------------------------------------------------
346434 | 346431 | 3 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 42 0 0 0 0 0 0 0 ]}
(1 row)
Time: 0.413 ms
This is used for future inserts, but existing rows show the same value, which was virtually set in the catalog to avoid re-writing all rows during the ALTER TABLE statement:
select * from iloveddl where id <2;
id | x | y
----+---+----
1 | | 42
(1 row)
Time: 189.657 ms
You can change the default value for future inserts without altering the existing data or the current value of the column. Here is an example:
alter table iloveddl alter column y set default 100;
ALTER TABLE
Time: 2.039 ms
insert into iloveddl(id) values (-1);
INSERT 0 1
Time: 1.516 ms
select * from iloveddl where id <2;
id | x | y
----+---+-----
1 | | 42
-1 | | 100
(2 rows)
Time: 207.727 ms
This change affects only new rows. For example, after setting the default to 100, inserting a row without specifying "y" will automatically assign 100. It does not alter existing rows. For instance, the rows that existed before adding the column will still show their previous "y" values, like 42.
The default value has been changed in pg_attrdef, to 100, which proves that the value for existing rows, 42, is stored elewhere:
select * from pg_attrdef
where adrelid='iloveddl'::regclass
;
oid | adrelid | adnum | adbin
--------+---------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------
346435 | 346431 | 3 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 100 0 0 0 0 0 0 0 ]}
(1 row)
Time: 0.568 ms
PostgreSQL appears to set columns with default values as if all rows have that value, but this is done logically rather than physically. It stores extra information in pg_attribute to assign a default value for columns that are missing when reading a table row, ensuring consistent behavior without physically altering each row:
select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
from pg_attribute
where attrelid='iloveddl'::regclass
;
attname | attnum | attnotnull | atthasdef | atthasmissing | attmissingval
----------+--------+------------+-----------+---------------+---------------
tableoid | -6 | t | f | f |
cmax | -5 | t | f | f |
xmax | -4 | t | f | f |
cmin | -3 | t | f | f |
xmin | -2 | t | f | f |
ctid | -1 | t | f | f |
id | 1 | f | f | f |
x | 2 | f | f | f |
y | 3 | f | t | t | {42}
(9 rows)
Time: 0.529 ms
When PostgreSQL reads a row from a table and finds that the column "y" does not exist, it checks the null bitmap. If "y" is absent there as well, PostgreSQL adds the column to the result with the default value stored in the catalog as attmissingval.
This optimization is only applicable when the default value is a constant. For scenarios like adding an expiration date to passwords—such as forcing users to change their password annually—it makes sense to set a default for existing users to the next year from the current date. This works only if there's the same expiration date for all users.
NOT NULL
It is possible to add a non-nullable column, but a default value is required. Otherwise, it would result in a null in a non-nullable column:
alter table iloveddl add column z int not null
;
ERROR: column "z" of relation "iloveddl" contains null values
Time: 1.024 ms
alter table iloveddl add column z int not null default 42
;
ALTER TABLE
Time: 2.322 ms
This was quick, a metadata-only change. With NOT NULL DEFAULT, either the value is set physically in the row, or it is absent and the value comes from the catalog:
select * from pg_attrdef
where adrelid='iloveddl'::regclass
;
oid | adrelid | adnum | adbin
--------+---------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------
346435 | 346431 | 3 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 100 0 0 0 0 0 0 0 ]}
346436 | 346431 | 4 | {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location -1 :constvalue 4 [ 42 0 0 0 0 0 0 0 ]}
(2 rows)
Time: 0.503 ms
select attname, attnum, attnotnull, atthasdef, atthasmissing, attmissingval
from pg_attribute
where attrelid='iloveddl'::regclass
;
attname | attnum | attnotnull | atthasdef | atthasmissing | attmissingval
----------+--------+------------+-----------+---------------+---------------
tableoid | -6 | t | f | f |
cmax | -5 | t | f | f |
xmax | -4 | t | f | f |
cmin | -3 | t | f | f |
xmin | -2 | t | f | f |
ctid | -1 | t | f | f |
id | 1 | f | f | f |
x | 2 | f | f | f |
y | 3 | f | t | t | {42}
z | 4 | t | t | t | {42}
(10 rows)
Time: 0.465 ms
If I insert a new row without mentionning the new column, it is set with the current default value from pg_attrdef, and stored with it. If I query a row that was inserted before, it shows the value from pg_attribute.attmissingval:
insert into iloveddl(id,y) values (-2,null)
;
INSERT 0 1
Time: 8.692 ms
select * from iloveddl where id <2
;
id | x | y | z
----+---+-----+----
1 | | 42 | 42
-1 | | 100 | 42
-2 | | | 42
(3 rows)
Time: 195.648 ms
Summary of PostgreSQL behavior
The DEFAULT clause used in an ALTER TABLE ADD COLUMN statement serves two different purposes:
- Schema on write: When new rows are inserted without specifying a value for this column, the DEFAULT value is automatically assigned, functioning similarly to a trigger or a stored generated column.
- Schema on read: When querying rows that lack a value in this column, the DEFAULT value appears in the result set, similar to a view or a virtual column.
In PostgreSQL, there are three cases when reading a row. First, if a column is present in the null bitmap and flagged as null, the value returned is NULL. Second, if the column is present and flagged as not null, the actual value is returned. Lastly, if the column is not present in the null bitmap, the system returns a predefined value called attmissingval.
Comparison with MongoDB
In MongoDB, some part of the document structure can be defined using indexes and schema validation (schema-on-write). However, the application is free to add new fields to documents without impacting other objects, and interpret it in the application code (schema-on-read).
When a default value is needed, it is the application's responsibility to write it, rather than relying on a rule deployed in the database. If the absence of a field should be interpreted as another value, the application handles this with application code. In general, the logic is more complex than a single constant for all existing data, and may depend on other fields. Aggregation pipeline can code the logic so that it runs efficiently in the database.
I create a collection similar to my PostgreSQL table:
for (let i = 1; i <= 10_000_000; i += 1000)
db.iloveddl.insertMany(Array.from(
{length: 1000},
(_,k) => ({_id: i+k})
))
;
Here is how the first document looks like:
db.iloveddl.find({_id:{$lt:2}})
[ { _id: 1 } ]
The equivalent of ADD COLUMN NULL with no default is simply defining the field for new documents when needed, with nothing to do on existing documents because in MongoDB an inexisting field is the same as null.
New documents can set the field and if there is a default value, the application will set it:
db.iloveddl.insertOne({ _id: -1, y: 100 })
{ acknowledged: true, insertedId: -1 }
db.iloveddl.find({_id:{$lt:2}})
[ { _id: -1, y: 100 }, { _id: 1 } ]
SQL was designed primarily for non-programmers using the command line to manipulate data. In this context, it is important to deploy defaults and check constraints directly into the database. MongoDB is designed for databases where data manipulation is handled through application code, and developers prefer to centralize all logic within the application, where it can be reviewed and tested more easily. Typically, the document inserted into MongoDB comes from an application object, and defaults were set in the constructor.
The equivalent of ADD COLUMN NULL with DEFAULT is like in PostgreSQL, adding the field on read, except that it is done by the application code rather than a declaration in the database catalog:
db.iloveddl.aggregate([
{ $sort: { _id:1 } },
{ $project: {
_id: 1,
y: { $ifNull: [ "$y", 42 ] }
} }
])
[
{ _id: -1, y: 100 }, { _id: 1, y: 42 },
{ _id: 2, y: 42 }, { _id: 3, y: 42 },
{ _id: 4, y: 42 }, { _id: 5, y: 42 },
{ _id: 6, y: 42 }, { _id: 7, y: 42 },
{ _id: 8, y: 42 }, { _id: 9, y: 42 },
{ _id: 10, y: 42 }, { _id: 11, y: 42 },
{ _id: 12, y: 42 }, { _id: 13, y: 42 },
{ _id: 14, y: 42 }, { _id: 15, y: 42 },
{ _id: 16, y: 42 }, { _id: 17, y: 42 },
{ _id: 18, y: 42 }, { _id: 19, y: 42 }
]
Type "it" for more
If the logic for existing documents is more complex, like depending on other fields, it can be done the same way:
db.iloveddl.aggregate([
{ $sort: { _id: 1 } },
{ $project: {
_id: 1,
y: {
$ifNull: [
"$y",
{
$cond: [
{ $eq: [ { $mod: [ "$_id", 2 ] }, 0 ] },
42, // set 42 to "y" when "_id" is even
99 // set 42 to "y" when "_id" is odd
]
}
]
}
}
}
])
Like in PostgreSQL, such schema-on-read transformation can be implemented in the database as a view:
db.createView(
"iloveddl_conditional_y", // Name of the view
"iloveddl", // Source collection
[
{ $sort: { _id: 1 } },
{ $project: {
_id: 1,
y: {
$ifNull: [
"$y
Caching
Every time you use a computer, the cache is working to ensure your experience is fast.
July 07, 2025
The Future of Forums is Lies, I Guess
In my free time, I help run a small Mastodon server for roughly six hundred queer leatherfolk. When a new member signs up, we require them to write a short application—just a sentence or two. There’s a small text box in the signup form which says:
Please tell us a bit about yourself and your connection to queer leather/kink/BDSM. What kind of play or gear gets you going?
This serves a few purposes. First, it maintains community focus. Before this question, we were flooded with signups from straight, vanilla people who wandered in to the bar (so to speak), and that made things a little awkward. Second, the application establishes a baseline for people willing and able to read text. This helps in getting people to follow server policy and talk to moderators when needed. Finally, it is remarkably effective at keeping out spammers. In almost six years of operation, we’ve had only a handful of spam accounts.
I was talking about this with Erin Kissane last year, as she and Darius Kazemi conducted research for their report on Fediverse governance. We shared a fear that Large Language Models (LLMs) would lower the cost of sophisticated, automated spam and harassment campaigns against small servers like ours in ways we simply couldn’t defend against.
Anyway, here’s an application we got last week, for a user named mrfr:
Hi! I’m a queer person with a long-standing interest in the leather and kink community. I value consent, safety, and exploration, and I’m always looking to learn more and connect with others who share those principles. I’m especially drawn to power exchange dynamics and enjoy impact play, bondage, and classic leather gear.
On the surface, this is a great application. It mentions specific kinks, it uses actual sentences, and it touches on key community concepts like consent and power exchange. Saying “I’m a queer person” is a tad odd. Normally you’d be more specific, like “I’m a dyke” or “I’m a non-binary bootblack”, but the Zoomers do use this sort of phrasing. It does feel slightly LLM-flavored—something about the sentence structure and tone has just a touch of that soap-sheen to it—but that’s hardly definitive. Some of our applications from actual humans read just like this.
I approved the account. A few hours later, it posted this:
It turns out mrfr is short for Market Research Future, a company which produces vaguely spammy reports about all kinds of things from batteries to interior design. They actually have phone numbers on their web site, so I called +44 1720 412 167 to ask if they were aware of the posts. It is remarkably fun to ask business people about their interest in queer BDSM—sometimes stigma works in your favor. I haven’t heard back yet, but I’m guessing they either conducting this spam campaign directly, or commissioned an SEO company which (perhaps without their knowledge) is doing it on their behalf.
Anyway, we’re not the only ones. There are also mrfr accounts purporting to be a weird car enthusiast, a like-minded individual, a bear into market research on interior design trends, and a green building market research enthusiast in DC, Maryland, or Virginia. Over on the seven-user loud.computer, mrfr applied with the text:
I’m a creative thinker who enjoys experimental art, internet culture, and unconventional digital spaces. I’d like to join loud.computer to connect with others who embrace weird, bold, and expressive online creativity, and to contribute to a community that values playfulness, individuality, and artistic freedom.
This too has the sheen of LLM slop. Of course a human could be behind these accounts—doing some background research and writing out detailed, plausible applications. But this is expensive, and a quick glance at either of our sites would have told that person that we have small reach and active moderation: a poor combination for would-be spammers. The posts don’t read as human either: the 4bear posting, for instance, incorrectly summarizes a report on interior design markets as if it offered interior design tips.
I strongly suspect that Market Research Future, or a subcontractor, is conducting an automated spam campaign which uses a Large Language Model to evaluate a Mastodon instance, submit a plausible application for an account, and to post slop which links to Market Research Future reports.
In some sense, this is a wildly sophisticated attack. The state of NLP seven years ago would have made this sort of thing flatly impossible. It is now effective. There is no way for moderators to robustly deny these kinds of applications without also rejecting real human beings searching for community.
In another sense, this attack is remarkably naive. All the accounts are named mrfr, which made it easy for admins to informally chat and discover the coordinated nature of the attack. They all link to the same domain, which is easy to interpret as spam. They use Indian IPs, where few of our users are located; we could reluctantly geoblock India to reduce spam. These shortcomings are trivial to overcome, and I expect they have been already, or will be shortly.
A more critical weakness is that these accounts only posted obvious spam; they made no effort to build up a plausible persona. Generating plausible human posts is more difficult, but broadly feasible with current LLM technology. It is essentially impossible for human moderators to reliably distinguish between an autistic rope bunny (hi) whose special interest is battery technology, and an LLM spambot which posts about how much they love to be tied up, and also new trends in battery chemistry. These bots have been extant on Twitter and other large social networks for years; many Fediverse moderators believe only our relative obscurity has shielded us so far.
These attacks do not have to be reliable to be successful. They only need to work often enough to be cost-effective, and the cost of LLM text generation is cheap and falling. Their sophistication will rise. Link-spam will be augmented by personal posts, images, video, and more subtle, influencer-style recommendations—“Oh my god, you guys, this new electro plug is incredible.” Networks of bots will positively interact with one another, throwing up chaff for moderators. I would not at all be surprised for LLM spambots to contest moderation decisions via email.
I don’t know how to run a community forum in this future. I do not have the time or emotional energy to screen out regular attacks by Large Language Models, with the knowledge that making the wrong decision costs a real human being their connection to a niche community. I do not know how to determine whether someone’s post about their new bicycle is genuine enthusiasm or automated astroturf. I don’t know how to foster trust and genuine interaction in a world of widespread text and image synthesis—in a world where, as one friend related this week, newbies can ask an LLM for advice on exploring their kinks, and the machine tells them to try solo breath play.
In this world I think woof.group, and many forums like it, will collapse.
One could imagine more sophisticated, high-contact interviews with applicants, but this would be time consuming. My colleagues relate stories from their companies about hiring employees who faked their interviews and calls using LLM prompts and real-time video manipulation. It is not hard to imagine that even if we had the time to talk to every applicant individually, those interviews might be successfully automated in the next few decades. Remember, it doesn’t have to work every time to be successful.
Maybe the fundamental limitations of transformer models will provide us with a cost-effective defense—we somehow force LLMs to blow out the context window during the signup flow, or come up with reliable, constantly-updated libraries of “ignore all previous instructions”-style incantations which we stamp invisibly throughout our web pages. Barring new inventions, I suspect these are unlikely to be robust against a large-scale, heterogenous mix of attackers. This arms race also sounds exhausting to keep up with. Drew DeVault’s Please Stop Externalizing Your Costs Directly Into My Face weighs heavy on my mind.
Perhaps we demand stronger assurance of identity. You only get an invite if you meet a moderator in person, or the web acquires a cryptographic web-of-trust scheme. I was that nerd trying to convince people to do GPG key-signing parties in high school, and we all know how that worked out. Perhaps in a future LLM-contaminated web, the incentives will be different. On the other hand, that kind of scheme closes off the forum to some of the people who need it most: those who are closeted, who face social or state repression, or are geographically or socially isolated.
Perhaps small forums will prove unprofitable, and attackers will simply give up. From my experience with small mail servers and web sites, I don’t think this is likely.
Right now, I lean towards thinking forums like woof.group will become untenable under LLM pressure. I’m not sure how long we have left. Perhaps five or ten years? In the mean time, I’m trying to invest in in-person networks as much as possible. Bars, clubs, hosting parties, activities with friends.
That, at least, feels safe for now.
Deploying MongoDB Test Environments with Terraform and Ansible
Want to spin up fully functional environments for trying out Percona Server for MongoDB, complete with Percona’s backup and monitoring solutions in minutes? We recently made our automation framework publicly available, which makes it easy to create and manage these environments either on your local machine or in public cloud environments. Why we built this […]
July 03, 2025
Erik (author of Lucene in Action) is starting a series on search indexes
A Tale of Two Databases: How PostgreSQL and MySQL Handle Torn Pages
Welcome to this first installment of the blog series, which explores how PostgreSQL and MySQL deal with different aspects of relational databases. As a long-time open source database administrator, I have always been fascinated by the differences in how these two databases handle various challenges and how DBAs who know one of these technologies often […]
The principles of extreme fault tolerance
The principles and processes we follow for fault tolerance.
July 02, 2025
Testing ReadySet as a Query Cacher for PostgreSQL (Plus ProxySQL and HAproxy) Part 2: Test Results
In the first post of this series (Testing ReadySet as a Query Cacher for PostgreSQL (Plus ProxySQL and HAproxy) Part 1: How-To), I presented my test environment and methodology and explained how to install ReadySet, ProxySQL, and HAproxy and configure them to work with PostgreSQL. In this final part, I present the different test scenarios […]
Chapter 7: Distributed Recovery (Concurrency Control Book)
Chapter 7 of the Concurrency Control and Recovery in Database Systems book by Bernstein and Hadzilacos (1987) tackles the distributed commit problem: ensuring atomic commit across a set of distributed sites that may fail independently.
The chapter covers these concepts:
- The challenges of transaction processing in distributed database systems (which wasn't around in 1987)
- Failure models (site and communication) and timeout-based detection
- The definition and guarantees of Atomic Commitment Protocols (ACPs)
- The Two-Phase Commit (2PC) protocol (and its cooperative termination variant)
- The limitations of 2PC (especially blocking)
- Introduction and advantages of the Three-Phase Commit (3PC) protocol
Despite its rigor and methodical development, the chapter feels like a suspense movie today. We, the readers, equipped with modern tools like FLP impossibility result and Paxos protocol watch as the authors try to navigate a minefield, unaware of the lurking impossibility results that were published a couple years earlier and the robust consensus frameworks (Viewstamped replication and Paxos) that would emerge just a few years later.
Ok, let's dive in.
Atomic Commitment Protocol (ACP) problem
The problem is to ensure that in the presence of partial failures (individual site failures), a distributed transaction either commits at all sites or aborts at all sites, and never splits the decision. The authors define the desired properties of ACPs through a formal list of conditions (AC1–AC5).
We know that achieving these in an asynchronous setting with even one faulty process is impossible as FLP impossibility result established in 1985. Unfortunately, this impossibility result is entirely absent from the chapter’s framework. The authors implicitly assume bounded (and with known bounds) message delays and processing times, effectively assuming a synchronous system. That is an unrealistic portrayal of real-world distributed systems, even today in the data-centers.
A more realistic framework for distributed systems is the partially asynchronous model. Rather than assuming known and fixed bounds on message delays and processing times, the partially asynchronous model allows for periods of unpredictable latency, with the only guarantee being that bounds exist, just not that we know them. This model captures the reality of modern data centers, where systems often operate efficiently but can occasionally experience transient slowdowns or outages where fixed bounds would be violated and maybe higher bounds might be established for some duration before convergence to stable. This also motivates the use of weak failure detectors, which cannot definitively distinguish between a crashed node and a slow one.
This is where Paxos enters the picture. Conceived just a few years after this chapter, Paxos provides a consensus protocol that is safe under all conditions, including arbitrary message delays, losses, and reordering. It guarantees progress only during periods of partial synchrony, when the system behaves reliably enough for long enough, but it never violates safety even when conditions degrade. This doesn't conflict with what the FLP impossibility result of 1985 proves: you cannot simultaneously guarantee both safety and liveness in an asynchronous system with even one crash failure. But that doesn't mean you must give up on safety. In fact, the brilliance of Paxos lies in this separation: it preserves correctness unconditionally and defers liveness until the network cooperates. This resilience is exactly what's missing in the ACP designs of Bernstein and Hadzilacos even when using 3PC protocols.
If you like a quick intro to the FLP and earlier Coordinated Attack impossibility results, these three posts would help.
2PC and 3PC protocols
The authors first present the now-classic Two-Phase Commit (2PC) protocol, where the coordinator collects YES/NO votes from participants (the voting phase) and then broadcasts a COMMIT or ABORT (the decision phase). While 2PC satisfies AC1–AC4 in failure-free cases, it fails AC5 under partial failures. If a participant votes YES and then loses contact with the coordinator, it is stuck in an uncertainty period, unable to decide unilaterally whether to commit or abort. The authors provide a cooperative termination protocol, where uncertain participants consult peers to try to determine the outcome. It reduces, but does not eliminate, blocking.
Thus comes the Three-Phase Commit (3PC) protocol, which attempts to address 2PC's blocking flaw by introducing an intermediate state: PRE-COMMIT. The idea is that before actually committing, the coordinator ensures all participants are "prepared" and acknowledges that they can commit. Only once everyone has acknowledged this state does the coordinator send the final COMMIT. If a participant times out during this phase, it engages in a distributed election protocol and uses a termination rule to reach a decision.
Indeed, in synchronous systems, 3PC is non-blocking, and provides an improvement over 2PC. The problem is that 3PC relies critically on timing assumptions, always requiring bounded message and processing delays. The protocol's reliance on perfect timeout detection and a perfect failure detector makes it fragile. As another secondary problem, the 3PC protocol discussed in the book (Skeen 1982) has also been shown to contain some subtle bugs as well even in the synchronous model.
In retrospect
Reading this chapter today feels like watching a group of mountaineers scale a cliff without realizing they’re missing key gear. I spurted out my tea when I read these lines in the 3PC discussion. "To complete our discussion of this protocol we must address the issue of elections and what to do with blocked processes." Oh, no, don't go up that path without Paxos and distributed consensus formalization!! But the book predates Paxos (1989, though published later), Viewstamped Replication (1988), and the crystallization of the consensus problem. It also seems to be completely unaware of the FLP impossibility result (1985), which should have stopped them in their tracks.
This chapter is an earnest and technically careful work, but it's flying blind without the consensus theory that would soon reframe the problem. The chapter is an important historical artifact. It captures the state of the art before consensus theory illuminated the terrain. The authors were unable to realize that the distributed commit problem includes in it the distributed consensus problem, and that all the impossibility, safety, and liveness tradeoffs that apply to consensus apply here too.
Modern distributed database systems use Paxos-based commit. This is often via 2PC over Paxos/Raft groups for participant-sites. See for example our discussion and TLA+ modeling of distributed transactions in MongoDB.
Miscellaneous
This is funny. Someone is trolling on Wikipedia, trying to introduce Tupac as an alternative way to refer to 2PC.
July 01, 2025
Fluent Commerce’s approach to near-zero downtime Amazon Aurora PostgreSQL upgrade at 32 TB scale using snapshots and AWS DMS ongoing replication
Fluent Commerce, an omnichannel commerce platform, offers order management solutions that enable businesses to deliver seamless shopping experiences across various channels. Fluent uses Amazon Aurora PostgreSQL-Compatible Edition as its high-performance OLTP database engine to process their customers’ intricate search queries efficiently. Fluent Commerce strategically combined AWS-based upgrade approaches—including snapshot restores and AWS DMS ongoing replication—to seamlessly upgrade their 32 TB Aurora PostgreSQL databases with minimal downtime. In this post, we explore a pragmatic and cost-effective approach to achieve near-zero downtime during database upgrades. We explore the method of using the snapshot and restore method followed by continuous replication using AWS DMS.
Accelerate SQL Server to Amazon Aurora migrations with a customizable solution
Migrating from SQL Server to Amazon Aurora can significantly reduce database licensing costs and modernize your data infrastructure. To accelerate your migration journey, we have developed a migration solution that offers ease and flexibility. You can use this migration accelerator to achieve fast data migration and minimum downtime while customizing it to meet your specific business requirements. In this post, we showcase the core features of the migration accelerator, demonstrated through a complex use case of consolidating 32 SQL Server databases into a single Amazon Aurora instance with near-zero downtime, while addressing technical debt through refactoring.
Testing ReadySet as a Query Cacher for PostgreSQL (Plus ProxySQL and HAproxy) Part 1: How-To
A couple of weeks ago, I attended a PGDay event in Blumenau, a city not far away from where I live in Brazil. Opening the day were former Percona colleagues Marcelo Altmann and Wagner Bianchi, showcasing ReadySet’s support for PostgreSQL. Readyset is a source-available database cache service that differs from other solutions by not relying […]
Announcing PlanetScale for Postgres
PlanetScale now supports Postgres
Benchmarking Postgres
Benchmarking Postgres in a transparent, standardized and fair way is challenging. Here, we look at the process of how we did it in-depth
Framer upgrades user-facing analytics to real-time with Tinybird
Learn how the popular no-code website builder transformed their analytics offering from basic metrics to advanced features like A/B testing and funnels, enabling new revenue streams with Tinybird's real-time analytics infrastructure.
June 30, 2025
Strong consistency 👉🏻 MongoDB highly available durable writes
In the previous post, I used strace to display all calls to write and sync to disk from any MongoDB server thread:
strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0
Adding replicas for High Availability
I did this with a single server, started with Atlas CLI. Let's do the same on a replicaset with three servers. I start it with the following Docker Compose:
services:
mongo-1:
image: mongo:8.0.10
ports:
- "27017:27017"
volumes:
- ./pgbench-mongo.js:/pgbench-mongo.js:ro
- mongo-data-1:/data/db
command: mongod --bind_ip_all --replSet rs0
networks:
- mongoha
mongo-2:
image: mongo:8.0.10
ports:
- "27018:27017"
volumes:
- ./pgbench-mongo.js:/pgbench-mongo.js:ro
- mongo-data-2:/data/db
command: mongod --bind_ip_all --replSet rs0
networks:
- mongoha
mongo-3:
image: mongodb/mongodb-community-server:latest
ports:
- "27019:27017"
volumes:
- ./pgbench-mongo.js:/pgbench-mongo.js:ro
- mongo-data-3:/data/db
command: mongod --bind_ip_all --replSet rs0
networks:
- mongoha
init-replica-set:
image: mongodb/mongodb-community-server:latest
depends_on:
- mongo-1
- mongo-2
- mongo-3
entrypoint: |
bash -xc '
sleep 10
mongosh --host mongo-1 --eval "
rs.initiate( {_id: \"rs0\", members: [
{_id: 0, priority: 3, host: \"mongo-1:27017\"},
{_id: 1, priority: 2, host: \"mongo-2:27017\"},
{_id: 2, priority: 1, host: \"mongo-3:27017\"}]
});
"
'
networks:
- mongoha
volumes:
mongo-data-1:
mongo-data-2:
mongo-data-3:
networks:
mongoha:
driver: bridge
I started this with docker compose up -d ; sleep 10 and then ran the strace command. I connected to the primary node with docker compose exec -it mongo-1 mongosh
run some transactions
I've executed the same as in the previous post, with ten writes to a collection:
db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });
for (let i = 1; i <= 10; i++) {
print(` ${i} ${new Date()}`)
db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} ${new Date()}`)
}
1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
1 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:05:38 GMT+0000 (Coordinated Universal Time)
Here is the strace output during this:
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61184) = 512 <0.000086>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55808) = 384 <0.000097>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000656>
[pid 8786] 10:05:38 <... fdatasync resumed>) = 0 <0.002739>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54528) = 384 <0.000129>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 61696) = 512 <0.000094>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001070>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56192) = 384 <0.000118>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000927>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 54912) = 384 <0.000112>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000687>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62208) = 512 <0.000066>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000717>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56576) = 384 <0.000095>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000745>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55296) = 384 <0.000063>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000782>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 62720) = 512 <0.000084>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56960) = 384 <0.000080>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000814>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 55680) = 384 <0.000365>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000747>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63232) = 512 <0.000096>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000724>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57344) = 384 <0.000108>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001432>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56064) = 384 <0.000118>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000737>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 63744) = 512 <0.000061>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000636>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57728) = 384 <0.000070>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000944>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56448) = 384 <0.000105>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000712>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64256) = 512 <0.000092>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000742>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58112) = 384 <0.000067>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000704>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 56832) = 384 <0.000152>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000732>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 64768) = 512 <0.000061>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000672>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58496) = 384 <0.000062>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000653>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57216) = 384 <0.000102>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.001502>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65280) = 512 <0.000072>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 58880) = 384 <0.000123>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002> <unfinished ...>
[pid 8786] 10:05:38 <... fdatasync resumed>) = 0 <0.001538>
[pid 8736] 10:05:38 <... fdatasync resumed>) = 0 <0.000625>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57600) = 384 <0.000084>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000847>
[pid 8786] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 512, 65792) = 512 <0.000060>
[pid 8786] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000661>
[pid 8736] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 59264) = 384 <0.000074>
[pid 8736] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000779>
[pid 8889] 10:05:38 pwrite64(13</data/db/journal/WiredTigerLog.0000000002>, ""..., 384, 57984) = 384 <0.000077>
[pid 8889] 10:05:38 fdatasync(13</data/db/journal/WiredTigerLog.0000000002>) = 0 <0.000816>
I can see writes and sync from three processes. Let's check which process belongs to which container:
for pid in 8736 8786 8889; do
cid=$(grep -ao 'docker[-/][0-9a-f]\{64\}' /proc/$pid/cgroup | head -1 | grep -o '[0-9a-f]\{64\}')
svc=$(docker inspect --format '{{ index .Config.Labels "com.docker.compose.service"}}' "$cid" 2>/dev/null)
echo "PID: $pid -> Container ID: $cid -> Compose Service: ${svc:-<not-found>}"
done
PID: 8736 -> Container ID: 93e3ebd715867f1cd885d4c6191064ba0eb93b02c0884a549eec66026c459ac2 -> Compose Service: mongo-3
PID: 8786 -> Container ID: cf52ad45d25801ef1f66a7905fa0fb4e83f23376e4478b99dbdad03456cead9e -> Compose Service: mongo-1
PID: 8889 -> Container ID: c28f835a1e7dc121f9a91c25af1adfb1d823b667c8cca237a33697b4683ca883 -> Compose Service: mongo-2
This confirms that by default, the WAL is synced to disk at commit on each replica and not only on the primary.
Simulate one node failure
[pid 8786] is mongo-1 and it is my primary:
rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
...
mongo-1:27017
I stop one replica:
docker compose pause mongo-3
[+] Pausing 1/0
✔ Container pgbench-mongo-mongo-3-1 Paused
I run my updates again, they are not impacted by one replica down:
rs0 [direct: primary] test> rs.status().members.find(r=>r.state===1).name
mongo-1:27017
rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
... print(` ${i} ${new Date()}`)
... db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
... print(` ${i} ${new Date()}`)
... }
...
1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
1 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
5 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
6 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
7 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
8 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
9 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
10 Mon Jun 30 2025 10:12:28 GMT+0000 (Coordinated Universal Time)
Simulate two nodes failure
I stopped another replica:
docker compose pause mongo-2
[+] Pausing 1/0
✔ Container demo-mongo-2-1 Paused
As there's no quorum anymore, with only one replica in a replicaset of three members, the primary was stepped down and cannot serve reads or updates:
rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
... print(` ${i} ${new Date()}`)
... db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
... print(` ${i} ${new Date()}`)
... }
1 Mon Jun 30 2025 09:28:36 GMT+0000 (Coordinated Universal Time)
MongoServerError[NotWritablePrimary]: not primary
Reads from secondary
The node that remains is now a secondary and exposes the last writes acknowledged by the majority:
rs0 [direct: secondary] test> db.mycollection.find()
[ { _id: 1, num: 20 } ]
rs0 [direct: secondary] test> db.mycollection.find().readConcern("majority")
[ { _id: 1, num: 20 } ]
If the other nodes restart but are isolated from this secondary, the secondary still show the same timeline consistent but stale reads.
I simulate that by dicoonnecting this node, and restarting the others:
docker network disconnect demo_mongoha demo-mongo-1-1
docker unpause demo-mongo-2-1
docker unpause demo-mongo-3-1
As the two others form a quorum, there is a primary that accepts the writes:
-bash-4.2# docker compose exec -it mongo-2 mongosh
Current Mongosh Log ID: 686264bd3e0326801369e327
Connecting to: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+2.5.2
Using MongoDB: 8.0.10
Using Mongosh: 2.5.2
rs0 [direct: primary] test> for (let i = 1; i <= 10; i++) {
... print(` ${i} ${new Date()}`)
... db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
... print(` ${i} ${new Date()}`)
... }
1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
1 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
2 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
3 Mon Jun 30 2025 10:20:09 GMT+0000 (Coordinated Universal Time)
4 Mon Jun 30 2025
The PG_TDE Extension Is Now Ready for Production
Lately, it feels like every time I go to a technical conference, someone is talking about how great PostgreSQL is. I’d think it’s just me noticing, but the rankings and surveys say otherwise. PostgreSQL is simply very popular. From old-school bare metal setups to VMs, containers, and fully managed cloud databases, PostgreSQL keeps gaining ground. And […]
We Made Postgres Writes Faster, but it Broke Replication
How we made an LSM tree data structure safe for Postgres physical replication
We Made Postgres Writes Faster, but it Broke Replication
Exploring LSM tree implementation in PostgreSQL for better write performance.
We Made Postgres Writes Faster, but it Broke Replication
Exploring LSM tree implementation in PostgreSQL for better write performance.
June 28, 2025
Flush to disk on commit 👉🏻 MongoDB durable writes
A Twitter (𝕏) thread was filled with misconceptions about MongoDB, spreading fear, uncertainty, and doubt (FUD). This led one user to question whether MongoDB acknowledges writes before they are actually flushed to disk:
Doesn't MongoDB acknowledge writes before it's actually flushed them to disk?
MongoDB, like many databases, employs journaling—also known as write-ahead logging (WAL)—to ensure durability (the D in ACID) with high performance. This involves safely recording write operations in the journal, and ensuring they are flushed to disk before the commit is acknowledged. Further details can be found in the documentation under Write Concern and Journaling
Here is how you can test it, in a lab, with Linux STRACE and GDB, to debunk the myths.
Start the lab
I created a local MongoDB server. I uses a single-node local atlas cluster here but you can do the same with replicas:
atlas deployments setup atlas --type local --port 27017 --force
Start it if it was stopped, and connect with MongoDB Shell:
atlas deployment start atlas
mongosh
Trace the system calls with strace
In another terminal, I used strace to display the system calls (-e trace) to write (pwrite64) and sync (fdatasync) the files, with the file names (-yy), by the MongoDB server process (-p $(pgrep -d, mongod)) and its threads (-f), with the execution time and timestamp (-tT):
strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0
Some writes and sync happen in the background
[pid 2625869] 08:26:13 fdatasync(11</data/db/WiredTiger.wt>) = 0 <0.000022>
[pid 2625869] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19072) = 384 <0.000024>
[pid 2625869] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002123>
[pid 2625868] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 128, 19456) = 128 <0.000057>
[pid 2625868] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002192>
[pid 2625868] 08:26:23 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19584) = 384 <0.000057>
[pid 2625868] 08:26:23 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002068>
[pid 2625868] 08:26:33 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19968) = 384 <0.000061>
[pid 2625868] 08:26:33 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002747>
[pid 2625868] 08:26:43 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20352) = 384 <0.000065>
[pid 2625868] 08:26:43 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.003008>
[pid 2625868] 08:26:53 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20736) = 384 <0.000075>
[pid 2625868] 08:26:53 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002092>
[pid 2625868] 08:27:03 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 21120) = 384 <0.000061>
[pid 2625868] 08:27:03 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002527>
[pid 2625869] 08:27:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.000033>
Write to the collection
In the MongoDB shell, I created a collection and ran ten updates:
db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });
for (let i = 1; i <= 10; i++) {
db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} ${new Date()}`)
}
The strace output the following when running the loop of ten updates:
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76288) = 512 <0.000066>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001865>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76800) = 512 <0.000072>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77312) = 512 <0.000056>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001641>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77824) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78336) = 512 <0.000175>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001944>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78848) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001829>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79360) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001917>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79872) = 512 <0.000050>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002260>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80384) = 512 <0.000035>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001940>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80896) = 512 <0.000054>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001984>
Each write (pwrite64) to the journal files was followed by a sync to disk (fdatasync). This system call is well documented:
FSYNC(2) Linux Programmer's Manual FSYNC(2)
NAME
fsync, fdatasync - synchronize a file's in-core state with storage device
DESCRIPTION
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to
the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.
This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.
...
fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification
...
The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.
Since I display both the committed time and the system call trace times, you can see that they match. The output related to the traces above demonstrates this alignment:
1 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
2 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
3 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
4 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
5 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
6 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
7 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
8 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
9 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
10 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
Multi-document transactions
The previous example ran ten autocommit updates, each calling a synchronisation to disk.
In general, with good document data modeling, a document should match the business transaction. However, it is possible to use multi-document transaction and they are ACID (atomic, consistent, isolated and durable). Using multi-document transactions also reduces the sync latency as it is required only once per transaction, at commit.
I've run the following with five transactions, each running one update and one insert:
const session = db.getMongo().startSession();
for (let i = 1; i <= 5; i++) {
session.startTransaction();
const sessionDb = session.getDatabase(db.getName());
sessionDb.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} updated ${new Date()}`)
sessionDb.mycollection.insertOne( { answer:42 });
print(` ${i} inserted ${new Date()}`)
session.commitTransaction();
print(` ${i} committed ${new Date()}`)
}
Strace still shows ten calls to pwrite64 and fdatasync. I used this multi-document transaction to go further and prove that not only the commit triggers a sync to disk, but also waits for its acknlowledgement before returning a sucessful feedback to the application.
Inject some latency with gdb
To show that the commit waits for the acknowledgment of fdatasync I used a GDB breakpoint for the fdatasyc call.
I stopped strace, and started GDB with a script that adds a latency of five seconds to fdatasync:
cat > gdb_slow_fdatasync.gdb <<GDB
break fdatasync
commands
shell sleep 5
continue
end
continue
GDB
gdb --batch -x gdb_slow_fdatasync.gdb -p $(pgrep mongod)
I ran the five transactions and two writes. GDB shows when it hits the breakpoint:
Thread 31 "JournalFlusher" hit Breakpoint 1, 0x0000ffffa6096eec in fdatasync () from target:/lib64/libc.so.6
My GDB script automatically waits fives seconds and continues the program, until the next call to fdatasync.
Here was the output from my loop with five transactions:
1 updated Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
1 inserted Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
1 committed Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 updated Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 inserted Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 committed Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 updated Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 inserted Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 committed Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 updated Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 inserted Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 committed Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
5 updated Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
5 inserted Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
The insert and update operations occur immediately, but the commit itself waits five seconds, because of the latency I injected with GDB. This demonstrates that the commit waits for fdatasync, guaranteeing the flush to persistent storage. For this demo, I used all default settings in MongoDB 8.0, but this behavior can still be tuned through write concern and journaling configurations.
Next time you encounter claims from ignorants people or detractors suggesting that MongoDB is not consistent or fails to flush committed changes to disk, you can confidently debunk these myths by referring to official documentation and conducting your own experiments.
Flush to disk on commit 👉🏻 MongoDB durable writes
A Twitter (𝕏) thread was filled with misconceptions about MongoDB, spreading fear, uncertainty, and doubt (FUD). This led one user to question whether MongoDB acknowledges writes before they are actually flushed to disk:
Doesn't MongoDB acknowledge writes before it's actually flushed them to disk?
MongoDB, like many databases, employs journaling—also known as write-ahead logging (WAL)—to ensure durability (the D in ACID) with high performance. This involves safely recording write operations in the journal, and ensuring they are flushed to disk before the commit is acknowledged. Further details can be found in the documentation under Write Concern and Journaling
Here is how you can test it, in a lab, with Linux STRACE and GDB, to debunk the myths.
Start the lab
I created a local MongoDB server. I uses a single-node local atlas cluster here but you can do the same with replicas:
atlas deployments setup atlas --type local --port 27017 --force
Start it if it was stopped, and connect with MongoDB Shell:
atlas deployment start atlas
mongosh
Trace the system calls with strace
In another terminal, I used strace to display the system calls (-e trace) to write (pwrite64) and sync (fdatasync) the files, with the file names (-yy), by the MongoDB server process (-p $(pgrep -d, mongod)) and its threads (-f), with the execution time and timestamp (-tT):
strace -tT -fp $(pgrep -d, mongod) -yye trace=pwrite64,fdatasync -qqs 0
Some writes and sync happen in the background
[pid 2625869] 08:26:13 fdatasync(11</data/db/WiredTiger.wt>) = 0 <0.000022>
[pid 2625869] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19072) = 384 <0.000024>
[pid 2625869] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002123>
[pid 2625868] 08:26:13 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 128, 19456) = 128 <0.000057>
[pid 2625868] 08:26:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002192>
[pid 2625868] 08:26:23 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19584) = 384 <0.000057>
[pid 2625868] 08:26:23 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002068>
[pid 2625868] 08:26:33 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 19968) = 384 <0.000061>
[pid 2625868] 08:26:33 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002747>
[pid 2625868] 08:26:43 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20352) = 384 <0.000065>
[pid 2625868] 08:26:43 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.003008>
[pid 2625868] 08:26:53 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 20736) = 384 <0.000075>
[pid 2625868] 08:26:53 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002092>
[pid 2625868] 08:27:03 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 384, 21120) = 384 <0.000061>
[pid 2625868] 08:27:03 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002527>
[pid 2625869] 08:27:13 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.000033>
Write to the collection
In the MongoDB shell, I created a collection and ran ten updates:
db.mycollection.drop();
db.mycollection.insert( { _id: 1, num:0 });
for (let i = 1; i <= 10; i++) {
print(` ${i} ${new Date()}`)
db.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} ${new Date()}`)
}
The strace output the following when running the loop of ten updates:
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76288) = 512 <0.000066>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001865>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 76800) = 512 <0.000072>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77312) = 512 <0.000056>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001641>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 77824) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001812>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78336) = 512 <0.000175>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001944>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 78848) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001829>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79360) = 512 <0.000043>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001917>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 79872) = 512 <0.000050>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.002260>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80384) = 512 <0.000035>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001940>
[pid 2625868] 08:33:07 pwrite64(13</data/db/journal/WiredTigerLog.0000000010>, ""..., 512, 80896) = 512 <0.000054>
[pid 2625868] 08:33:07 fdatasync(13</data/db/journal/WiredTigerLog.0000000010>) = 0 <0.001984>
Each write (pwrite64) to the journal files was followed by a sync to disk (fdatasync). This system call is well documented:
FSYNC(2) Linux Programmer's Manual FSYNC(2)
NAME
fsync, fdatasync - synchronize a file's in-core state with storage device
DESCRIPTION
fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to
the disk device (or other permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted.
This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.
...
fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification
...
The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.
Since I display both the committed time and the system call trace times, you can see that they match. The output related to the traces above demonstrates this alignment:
1 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
2 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
3 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
4 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
5 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
6 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
7 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
8 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
9 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
10 Sat Jun 28 2025 08:33:07 GMT+0000 (Greenwich Mean Time)
Multi-document transactions
The previous example ran ten autocommit updates, each calling a synchronisation to disk.
In general, with good document data modeling, a document should match the business transaction. However, it is possible to use multi-document transaction and they are ACID (atomic, consistent, isolated and durable). Using multi-document transactions also reduces the sync latency as it is required only once per transaction, at commit.
I've run the following with five transactions, each running one update and one insert:
const session = db.getMongo().startSession();
for (let i = 1; i <= 5; i++) {
session.startTransaction();
const sessionDb = session.getDatabase(db.getName());
sessionDb.mycollection.updateOne( { _id: 1 }, { $inc: { num: 1 } });
print(` ${i} updated ${new Date()}`)
sessionDb.mycollection.insertOne( { answer:42 });
print(` ${i} inserted ${new Date()}`)
session.commitTransaction();
print(` ${i} committed ${new Date()}`)
}
Strace still shows ten calls to pwrite64 and fdatasync. I used this multi-document transaction to go further and prove that not only the commit triggers a sync to disk, but also waits for its acknlowledgement before returning a sucessful feedback to the application.
Inject some latency with gdb
To show that the commit waits for the acknowledgment of fdatasync I used a GDB breakpoint for the fdatasyc call.
I stopped strace, and started GDB with a script that adds a latency of five seconds to fdatasync:
cat > gdb_slow_fdatasync.gdb <<GDB
break fdatasync
commands
shell sleep 5
continue
end
continue
GDB
gdb --batch -x gdb_slow_fdatasync.gdb -p $(pgrep mongod)
I ran the five transactions and two writes. GDB shows when it hits the breakpoint:
Thread 31 "JournalFlusher" hit Breakpoint 1, 0x0000ffffa6096eec in fdatasync () from target:/lib64/libc.so.6
My GDB script automatically waits fives seconds and continues the program, until the next call to fdatasync.
Here was the output from my loop with five transactions:
1 updated Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
1 inserted Sat Jun 28 2025 08:49:32 GMT+0000 (Greenwich Mean Time)
1 committed Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 updated Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 inserted Sat Jun 28 2025 08:49:37 GMT+0000 (Greenwich Mean Time)
2 committed Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 updated Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 inserted Sat Jun 28 2025 08:49:42 GMT+0000 (Greenwich Mean Time)
3 committed Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 updated Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 inserted Sat Jun 28 2025 08:49:47 GMT+0000 (Greenwich Mean Time)
4 committed Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
5 updated Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
5 inserted Sat Jun 28 2025 08:49:52 GMT+0000 (Greenwich Mean Time)
The insert and update operations occur immediately, but the commit itself waits five seconds, because of the latency I injected with GDB. This demonstrates that the commit waits for fdatasync, guaranteeing the flush to persistent storage. For this demo, I used all default settings in MongoDB 8.0, but this behavior can still be tuned through write concern and journaling configurations.
I used GDB to examine the call stack. Alternatively, you can inject a delay with strace by adding this option: -e inject=fdatasync:delay_enter=5000000.
Look at the open source code
When calling fdatasync, errors can occur, and this may compromise durability if operations on the file descriptor continue (remember the PostgreSQL fsyncgate). MongoDB uses the open-source WiredTiger storage engine, which implemented the same solution as PostgreSQL to avoid that: panic instead of retry. You can review the os_fs.c code to verify this.
The fdatasync call is in the JournalFlusher thread and here is the backtrace:
#0 0x0000ffffa0b5ceec in fdatasync () from target:/lib64/libc.so.6
#1 0x0000aaaadf5312c0 in __posix_file_sync ()
#2 0x0000aaaadf4f53c8 in __log_fsync_file ()
#3 0x0000aaaadf4f58d4 in __wt_log_force_sync ()
#4 0x0000aaaadf4fb8b8 in __wt_log_flush ()
#5 0x0000aaaadf588348 in __session_log_flush ()
#6 0x0000aaaadf41b878 in mongo::WiredTigerSessionCache::waitUntilDurable(mongo::OperationContext*, mongo::WiredTigerSessionCache::Fsync, mongo::WiredTigerSessionCache::UseJournalListener) ()
#7 0x0000aaaadf412358 in mongo::WiredTigerRecoveryUnit::waitUntilDurable(mongo::OperationContext*) ()
#8 0x0000aaaadfbe855c in mongo::JournalFlusher::run() ()
Here are some entrypoints if you want to look at the code behind this:
- JournalFlusher, which calls the posix system call in WiredTiger's os_fs.c: journal_flusher.cpp
- waitForWriteConcern, calling the JournalFlusher from the connection's thread: write_concern.cpp
Have your opinions based on facts, not myths.
MongoDB began as a NoSQL database that prioritized availability and low latency over strong consistency. However, that was over ten years ago. As technology evolves, experts who refuse to constantly learn risk their knowledge becoming outdated, their skills diminishing, and their credibility suffering.
Today, MongoDB is a general-purpose database that supports transaction atomicity, consistency, isolation, and durability—whether the transaction involves a single document, or multiple documents.
Next time you encounter claims from ignorants or detractors suggesting that MongoDB is not consistent or fails to flush committed changes to disk, you can confidently debunk these myths by referring to official documentation, the open source code, and conducting your own experiments. MongoDB is similar to PostgreSQL: buffered writes and WAL sync to disk on commit.
Want to meet people, try charging them for it?
I have been blogging consistently since 2017. And one of my goals in
speaking publicly was always to connect with like-minded people. I
always left my email and hoped people would get in touch. Even while
my blog and twitter became popular, passing 1M views and 20k
followers, I basically never had people get in touch to chat or meet up.
So it felt kind of ridiculous when last November I started charging
people $100 to chat. I mean, who am
I? But people started showing up fairly immediately. Now granted the
money did not go to me. It went to an education non-profit and I
merely received the receipt.
And at this point I've met a number of interesting people, from VCs to
business professors to undergraduate students to founders and everyone
in between. People wanting to talk about trends in databases, about
how to succeed as a programmer, about marketing for developers, and so
on. Women and men thoughout North America, Europe, Africa, New
Zealand, India, Nepal, and so on. And I've raised nearly $6000 for
educational non-profits.
How is it that you go from giving away your time for free and getting
no hits to charging and almost immediately getting results? For one,
every person responded very positively to it being a fundraiser. It
also helps me be entirely shameless about sharing on social media
every single time someone donates; because it's such a positive thing.
But also I think that in "charging" for my time it helps people feel
more comfortable about actually taking my time, especially when we
have never met. It gives you a reasonable excuse to take time from
an internet rando.
On the other hand, a lot of people come for advice and I think giving
advice is pretty dangerous, especially since my background is not
super conventional. I try to always frame things as just sharing my
opinion and my perspective and that they should talk with many others
and not take my suggestions without consideration.
And there's also the problem that by charging everyone for my time
now, I'm no longer available to people who could maybe use it the
most. I do mention on my page that I will still take calls from people
who don't donate, as my schedule allows. But to be honest I feel less
incentivized to spend time when people do not donate. So I guess this
is an issue with the program.
But I mitigated even this slightly, and significantly jump-started the
program, during my 30th birthday when
I took calls with any person who donated at least $30.
Anyway, I picked this path because I have wanted to get involved with
helping students figure out their lives and careers. But without a
degree I am literally unqualified for many volunteering programs. And
I always found the time commitments for non-profits painful.
So until starting this I figured it wouldn't be until I retire that I
find some way to make a difference. But ultimately I kept meeting
people who were starting their own non-profits now or donated
significantly to help students. Peer pressure. I wanted to do my part
now. And 30 minutes of my time in return for a donation receipt has
been an easy trade.
While only raising a humble $6,000 to date, the Chat for
Education program has been more
successful than I imagined. I've met many amazing people through
it. And it's something that should be easy to keep up indefinitely.
I hope to meet you through it too!