a curated list of database news from authoritative sources

May 29, 2026

$exists and non-sparse indexes in MongoDB and in other DocumentDB

In SQL databases, NULL represents an unknown value — not the absence of a value. When a value is simply non-applicable for a given entity, the correct relational modeling approach is normalization: the entity gets no row in the relevant table at all, rather than a NULL in a column. This distinction becomes tricky with OUTER JOIN results, where the absence of a row is surfaced as NULL across all columns of the unmatched side, including key columns — making it easy to confuse "unknown value" with "no row existed."

MongoDB has its own subtlety: a field can be explicitly set to null or simply not exist in the document at all. In the BSON representation, these are distinct — one is a key with a null-typed value, the other is the absence of the key entirely. The schema is flexible: you can define a field or not. But in indexes, this distinction disappears. Except for partial indexes, indexes must have a key value for every document it covers. For documents where the field is missing, MongoDB uses null as a stand-in — the same key value used for explicit nulls. This means an index scan cannot distinguish between the two states, and resolving null vs. missing requires fetching the full document to apply a residual filter.

Consequently, a standard index scan with a filter on null or $exists is inexact: the query planner performs an index scan on the null key and then fetches the full document to verify whether the field is truly null or simply absent.

An example: { $exists: true } filter

When you query with { num: { $exists: true } }, you expect MongoDB to use an index on num. Let's test it on MongoDB, as well as some emulations: Oracle Database, Amazon DocumentDB (AWS), and DocumentDB extension on PostgreSQL (Microsoft).

Here is my test collection:

db.test.insertMany([
  { _id: 1, num: 42   },
  { _id: 2, num: 7    },
  { _id: 3, num: null },
  { _id: 4            },
  { _id: 5, num: 99   },
  { _id: 6            },
  { _id: 7, num: null },
  { _id: 8, num: 15   }
])

I have inserted eight documents:

  • four with real values (_id 1, 2, 5, 8),
  • two with the field explicitly set to null (_id 3, 7), and
  • two where the field is entirely absent (_id 4, 6).

The query { num: { $exists: true } } should return six documents — everything except _id 4 and 6.

Before touching indexes, notice that $exists is not the same as a null check:

db.test.find({ num: null })

[
  { _id: 3, num: null },
  { _id: 4 },
  { _id: 6 },
  { _id: 7, num: null }
]

db.test.find({ num: { $exists: false } })

[
  { _id: 4 },
  { _id: 6 }
]

db.test.find({ num: { $exists: true } })

[
  { _id: 1, num: 42 },
  { _id: 2, num: 7 },
  { _id: 3, num: null },
  { _id: 5, num: 99 },
  { _id: 7, num: null },
  { _id: 8, num: 15 }
]

A field set to null exists. A field not written into the document does not. This distinction is perfectly clear at the document level. At the index level, it is not.

Null in the index key is ambiguous

When MongoDB builds a B-tree index on num, it must create an entry for every document. For documents with no num field, the index key exists with a null value. For documents where num is explicitly set to null, it also stores null. Both cases produce the same index key.

Here is how the non-sparse index looks:

Non-sparse index on { num: 1 }:

  null  →  _id:3  { num: null }    explicit null
  null  →  _id:4  { }              missing field
  null  →  _id:6  { }              missing field
  null  →  _id:7  { num: null }    explicit null
  7     →  _id:2
  15    →  _id:8
  42    →  _id:1
  99    →  _id:5

The four entries under the null key are indistinguishable from the index alone. To evaluate $exists, the engine must read the actual document. This is called a residual predicate — a filter condition the index cannot resolve, deferred to a later fetch stage.

Another way to look at it: the document schema is flexible, with no structure declared upfront and fields that may or may not exist, whereas indexes are different—their schema is declared, and the key fields always exist.

MongoDB with a non-sparse index

I create a regular index, which is by default non-sparse and has one index entry per document (or more for multi-key indexes).

db.test.createIndex({ num: 1 })

db.test.find({ num: { $exists: true } }).explain("executionStats")

The execution plan shows what happens across the IXSCAN and FETCH stages:

executionStats: {
  nReturned: 6,
  totalKeysExamined: 8,
  totalDocsExamined: 8,
  executionStages: {
    stage: 'FETCH',
    filter: { num: { '$exists': true } },
    nReturned: 6,
    docsExamined: 8,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 8,
      isSparse: false,
      indexBounds: { num: [ '[MinKey, MaxKey]' ] },
      keysExamined: 8
    }
  }
}

The IXSCAN returns all 8 index entries across the full [MinKey, MaxKey] range. The FETCH stage then reads all 8 documents and applies filter: { num: { $exists: true } } as a residual predicate, discarding _id 4 and 6. Notice docsExamined: 8 but nReturned: 6 — two fetches were wasted. The index was used, but the null bucket forced unnecessary work.

MongoDB with a sparse index

A sparse index excludes documents where the indexed field is entirely absent. It does not exclude explicit null values. Documents _id 3 and 7 have num: null and are still indexed.

db.test.createIndex({ num: 1 }, { sparse: true })

db.test.find({ num: { $exists: true } }).explain("executionStats")

As I have no projection, there is still a FETCH, but only for the documents in the final result:

executionStats: {
  nReturned: 6,
  totalKeysExamined: 6,
  totalDocsExamined: 6,
  executionStages: {
    stage: 'FETCH',
    nReturned: 6,
    docsExamined: 6,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 6,
      isSparse: true,
      indexBounds: { num: [ '[MinKey, MaxKey]' ] },
      keysExamined: 6
    }
  }
}

keysExamined dropped from 8 to 6 — the two missing-field documents are not in the index. More importantly, the FETCH stage has no filter. There is no residual predicate. Every document pointed to by the sparse index either has a real value or has an explicit null — both satisfy $exists: true. The index structure itself proves the condition. The FETCH still happens because find() needs to return the documents, but it is doing useful work only, not wasted disambiguation.

Here is how the sparse index looks:

Sparse index on { num: 1 }:

  null  →  _id:3  { num: null }    explicit null — indexed
  null  →  _id:7  { num: null }    explicit null — indexed
  7     →  _id:2
  15    →  _id:8
  42    →  _id:1
  99    →  _id:5

  _id:4  { }  — not indexed
  _id:6  { }  — not indexed

The null bucket still exists in a sparse index, but it contains only explicit nulls. The ambiguity is gone.

Oracle Database

I reproduced the same on Oracle Database with the MongoDB emulation:

ora> db.test.createIndex({ num: 1 })
num_1
ora> db.test.find({ num: { $exists: true } }).explain("executionStats")
{
  queryPlanner: {
    namespace: 'ora.test',
    parsedQuery: { num: { '$exists': true } },
    rewrittenQuery: { num: { '$exists': true } },
    generatedSql: `select "DATA",rawtohex("RESID"),"ETAG" from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) ' type(strict))`,
    winningPlan: ' Plan Hash Value  : 3552627291 \n' +
      '\n' +
      '--------------------------------------------------------------------------------------------------\n' +
      '| Id  | Operation                             | Name            | Rows | Bytes | Cost | Time     |\n' +
      '--------------------------------------------------------------------------------------------------\n' +
      '|   0 | SELECT STATEMENT                      |                 |    1 | 24501 |    2 | 00:00:01 |\n' +
      '|   1 |   TABLE ACCESS BY INDEX ROWID BATCHED | test            |    1 | 24501 |    2 | 00:00:01 |\n' +
      '|   2 |    HASH UNIQUE                        |                 |    1 | 24501 |      |          |\n' +
      '| * 3 |     INDEX RANGE SCAN (MULTI VALUE)    | $ora:test.num_1 |    1 |       |    1 | 00:00:01 |\n' +
      '--------------------------------------------------------------------------------------------------\n' +
      '\n' +
      'Predicate Information (identified by operation id):\n' +
      '------------------------------------------\n' +
      `* 3 - access(JSON_QUERY("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$."num"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)\n` +
      "  MULTIVALUE)>HEXTORAW('01'))\n" +
      '\n' +
      '\n' +
      'Notes\n' +
      '-----\n' +
      '- Dynamic sampling used for this statement ( level = 2 )\n' +
      '\n'
  },
  serverInfo: { host: 'localhost', port: 27017, version: '7.0.22' },
  ok: 1
}
ora>

It doesn't display the execution statistics, but I can get it from the SQL endpoint:

sql> select /*+ gather_plan_statistics */ "DATA",rawtohex("RESID"),"ETAG" from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) ' type(strict));

DATA                    RAWTOHEX("RESID")    ETAG
_______________________ ____________________ ___________________________________
{"_id":3,"num":null}    03C104               523160F3D2777CB2E0637B5B000A71CD
{"_id":7,"num":null}    03C108               523160F3D27F7CB2E0637B5B000A71CD
{"_id":2,"num":7}       03C103               523160F3D2757CB2E0637B5B000A71CD
{"_id":8,"num":15}      03C109               523160F3D2817CB2E0637B5B000A71CD
{"_id":1,"num":42}      03C102               523160F3D2737CB2E0637B5B000A71CD
{"_id":5,"num":99}      03C106               523160F3D27B7CB2E0637B5B000A71CD

6 rows selected.

sql> select * from dbms_xplan.display_cursor(format=>'allstats last');

PLAN_TABLE_OUTPUT
____________________________________________________________________________________________________________________
SQL_ID  c08vsvqpn75vw, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA",rawtohex("RESID"),"ETAG"
from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) '
type(strict))

Plan hash value: 3552627291

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |      6 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| test            |      1 |      1 |      6 |00:00:00.01 |       2 |
|   2 |   HASH UNIQUE                       |                 |      1 |      1 |      6 |00:00:00.01 |       1 |
|*  3 |    INDEX RANGE SCAN (MULTI VALUE)   | $ora:test.num_1 |      1 |      1 |      6 |00:00:00.01 |       1 |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("test"."SYS_NC00005$">HEXTORAW('01'))

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)


26 rows selected.

The index range scan returned 6 entries, as if it were a sparse index. We cannot create a sparse index on Oracle Database:

ora> db.test.dropIndex({ num: 1 })
{ nIndexesWas: 2, ok: 1 }

ora> db.test.createIndex({ num: 1 } , { sparse: 1 })
MongoServerError[MONGO-67]: Unsupported index option: sparse

Amazon DocumentDB (AWS)

AWS DocumentDB speaks the MongoDB wire protocol but is built on a completely different architecture. The storage layer is distributed like Aurora, replicated across three availability zones. The query planner and storage engine are specific to Amazon DocumentDB and deliver performance characteristics that differ from both MongoDB and standard PostgreSQL.

Non-sparse index on Amazon DocumentDB

The { num: { $exists: true } } query does not use the non-sparse index (created as db.test.createIndex({ num: 1 })) on Amazon DocumentDB (tested on version 8, planner version 3):

queryPlanner: {
  plannerVersion: 3,
  winningPlan: { stage: 'COLLSCAN', filter: { num: { '$exists': true } } }
},
executionStats: {
  nReturned: '6',
  executionTimeMillis: '14.121',
  planningTimeMillis: '14.019',
  executionStages: {
    stage: 'COLLSCAN',
    nReturned: '6',
    executionTimeMillisEstimate: '0.025'
  }
}

The index is completely abandoned. The planner chose a full collection scan.

Sparse index on Amazon DocumentDB

With the index created as db.test.createIndex({ num: 1 }, { sparse: true }), the index is used:

queryPlanner: {
  plannerVersion: 3,
  winningPlan: { stage: 'IXSCAN', indexName: 'num_1', direction: 'forward' }
},
executionStats: {
  nReturned: '6',
  executionTimeMillis: '10.034',
  planningTimeMillis: '8.128',
  executionStages: {
    stage: 'IXSCAN',
    nReturned: '6',
    executionTimeMillisEstimate: '1.842',
    indexName: 'num_1',
    direction: 'forward'
  }
}

Every entry in the sparse index provably satisfies $exists: true. It scans 6 index entries and returns 6 documents. While a sparse index is optional in MongoDB, it is mandatory in Amazon DocumentDB to use an index for this query at all.

Microsoft DocumentDB on PostgreSQL

Microsoft DocumentDB is implemented as an open-source PostgreSQL extension, accessed via the MongoDB wire protocol through a compatible endpoint.

With DocumentDB on PostgreSQL, a sparse index is not required for an optimal access path. I created the index as db.test.createIndex({ num: 1 }) and used a hint to force the index, since on a small collection the cost-based planner would otherwise prefer a sequential scan:

db.test.find(
  { num: { $exists: true } }
).hint("num_1").explain("executionStats")

(Be careful when using a hint with MongoDB queries, as it may change the result, limiting the scan to what is indexed)

The execution plan reads only the necessary entries from the index:

executionStats: {
  nReturned: Long('6'),
  executionTimeMillis: 0.093,
  executionStartAtTimeMillis: 0.089,
  totalDocsExamined: Long('6'),
  totalKeysExamined: Long('6'),
  executionStages: {
    stage: 'FETCH',
    nReturned: Long('6'),
    executionTimeMillis: 0.093,
    executionStartAtTimeMillis: 0.089,
    totalKeysExamined: 6,
    numBlocksFromCache: 24,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: Long('6'),
      executionTimeMillis: 0.093,
      executionStartAtTimeMillis: 0.089,
      indexName: 'num_1',
      totalKeysExamined: 6,
      numBlocksFromCache: 24
    }
  }
}

It shows the same count for index entries (totalKeysExamined: 6) and documents fetched (totalDocsExamined: 6).

Here we can go further: we can bypass the MongoDB layer entirely and query PostgreSQL directly, seeing exactly what the database engine sees.

To understand why, we can look at the underlying implementation from the PostgreSQL catalog:

\d documentdb_data.documents_15

           Table "documentdb_data.documents_15"
     Column      |  Type  | Collation | Nullable | Default
-----------------+--------+-----------+----------+---------
 shard_key_value | bigint |           | not null |
 object_id       | bson   |           | not null |
 document        | bson   |           | not null |
Indexes:
    "collection_pk_15" PRIMARY KEY, btree (shard_key_value, object_id)
    "documents_rum_index_47" documentdb_extended_rum
        (document bson_extended_rum_composite_path_ops
         (pathspec='[ "num" ]', tl='2691'))

There are no individual columns for num, name, or any other document field. The entire document is stored as a single bson blob in the document column. PostgreSQL has no native knowledge of what is inside it. The collection name test maps to documents_15, where 15 is the collection's internal identifier.

The index is not a standard PostgreSQL B-tree. It is an Extended RUM index — documentdb_extended_rum — with a custom operator class: bson_extended_rum_composite_path_ops. RUM is an extension of GIN (Generalized Inverted Index) that adds support for ordering, range scans, and additional per-entry metadata. The operator class is the critical piece: it knows how to extract the num field from the opaque BSON blob and store it in a structure PostgreSQL can search. pathspec='[ "num" ]' tells it which field to index.

We can obtain the PostgreSQL execution plan directly using the DocumentDB API. I disabled sequential scans to override the cost-based planner's preference on this small table:

postgres=# set enable_seqscan to off;
           explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
  'test',
  '{
    "find": "test",
    "filter": { "num": { "$exists": true } }
  }'::documentdb_core.bson
);
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Index Scan using num_1 on documentdb_data.documents_15 collection (actual time=0.036..0.040 rows=6 loops=1)
   Output: document
   Index Cond: (collection.document @>= '{ "num" : { "$minKey" : 1 } }'::bson)
   Buffers: shared hit=3
 Planning:
   Buffers: shared hit=26
 Planning Time: 0.416 ms
 Execution Time: 0.052 ms
(8 rows)

The $exists: true predicate has been translated into a PostgreSQL index condition: document @>= '{ "num": { "$minKey": 1 } }'::bson. This uses a custom BSON operator @>= meaning "document has field num with a value greater than or equal to MinKey."

MinKey is a special BSON sentinel value that sits below every other BSON value in the type ordering. The condition @>= MinKey therefore means "field num exists and has any BSON value at all" — which is exactly $exists: true. Existence becomes a range scan from the minimum possible value: an elegant encoding.

The RUM index is path-based and only creates entries for paths that actually exist in documents. However, documents where num is absent also have their index entry, that are scanned by the opposite filter { num: {$exists": false} }:

postgres=# explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
  'test',
  '{
    "find": "test",
    "filter": { "num": { "$exists": false } }
  }'::documentdb_core.bson
);
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Index Scan using num_1 on documentdb_data.documents_15 collection (actual time=0.030..0.033 rows=2 loops=1)
   Output: document
   Index Cond: (collection.document @? '{ "num" : false }'::bson)
   Buffers: shared hit=3
 Planning:
   Buffers: shared hit=80
 Planning Time: 0.282 ms
 Execution Time: 0.070 ms
(8 rows)

This has read the two rows (rows=2) without the num field.

The complete picture

Here is the summary for { $exists: true } queries on the example above:

Non-sparse index Sparse index
MongoDB FETCH ← IXSCAN, 8 keys, 8 docs, residual filter, 2 wasted fetches FETCH ← IXSCAN, 6 keys, 6 docs, no residual filter
Amazon DocumentDB (AWS) COLLSCAN, no index, 8 docs IXSCAN, 6 keys
DocumentDB on PostgreSQL (Microsoft) FETCH ← IXSCAN, 6 keys, 6 docs, no residua... (truncated)

May 28, 2026

Percona Operator for PostgreSQL 3.0.0: Hard Fork, OLM Scoping, Major Upgrades

The Percona Operator for PostgreSQL 3.0.0 is here. This is the release that completes the hard fork of the operator from the Crunchy Data PostgreSQL Operator into a fully independent project, with a dedicated upstream.pgv2.percona.com API group for the inherited CRDs, an automatic CRD-rename rollout for existing 2.x installs on upgrade, and a public roadmap … Continued

The post Percona Operator for PostgreSQL 3.0.0: Hard Fork, OLM Scoping, Major Upgrades appeared first on Percona.

Guide your Amazon Aurora MySQL migration with Kiro powers

Today, we announce the Amazon Aurora MySQL power for Kiro. The power connects Kiro’s AI agent to Aurora MySQL and pairs live database access with curated best-practice guidance. You describe what you need in natural language. The agent generates the API calls, SQL, and configuration for you to review and run. In this post, we walk through how the power guides a production migration from Amazon Relational Database Service (Amazon RDS) for MySQL 8.0 to Aurora MySQL through four phases: assessment, replica creation, promotion, and post-cutover validation.

May 27, 2026

Optimize costs in Amazon Aurora

By implementing modern optimization techniques for Aurora, you can achieve additional cost reduction beyond traditional methods alone. This isn’t only about spending less—it’s about building a more efficient, scalable, and resilient database environment. In this post, we show you a structured approach to optimizing Amazon Aurora database costs. It outlines specific strategies, implementation steps, and best practices across different optimization areas.

Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Backup-Restore and PV Reuse

A Percona PostgreSQL operator pgBackRest restore is the simplest way to move off the Crunchy Data PostgreSQL Operator: take a full Crunchy backup, point the new Percona cluster’s dataSource at the existing pgBackRest archive, and the cluster bootstraps from it before its first start. This post covers that path, plus a second option, persistent-volume reuse, for cases … Continued

The post Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Backup-Restore and PV Reuse appeared first on Percona.

CedarDB: Features of April 2026

This post takes a closer look at some of the most impactful features we have shipped in CedarDB across our recent releases. Whether you have been following along closely or are just catching up, here is a deeper look at the additions we are most excited about.

Set-Returning Functions: Lock-Step Evaluation

v2026-04-20

When handling bulk data transformations or speeding up database inserts, a popular developer trick is to use multiple set-returning functions side-by-side in the SELECT clause to “zip” arrays together into individual rows. To improve performance and scalability, PostgreSQL changed its behavior regarding this in version 10, shifting from generating a massive cross-product to a much more efficient lock-step evaluation (see the PostgreSQL 10 Release Notes).

To guarantee seamless compatibility and keep your queries lightning-fast, CedarDB evaluates multiple set-returning functions in the SELECT list in the exact same lock-step manner.

-- Zipping up arrays in lock-step to quickly generate rows
SELECT UNNEST('{alice, bob, charlie}'::TEXT[]) AS user_name, 
 UNNEST('{active, inactive, active}'::TEXT[]) AS status, 
 UNNEST('{150, 200, 350}'::INT[]) AS score;

Instead of exploding into 27 rows of useless, cross-joined data, CedarDB cleanly steps through the arrays row-by-row to return exactly 3 perfectly paired rows. If you rely on array unnesting to batch your application’s database inserts, you can now enjoy highly scalable performance and behavior that is completely identical to modern PostgreSQL.

Note: This lock-step evaluation applies to other set-returning functions you might already know! Alongside UNNEST, you can use functions like generate_series, json_array_elements, or regexp_matches to efficiently generate and zip your data.

ON UPDATE CASCADE: Keep Your Data in Sync Automatically

v2026-04-20

Changing core identifiers, like a user’s handle or a department code, used to mean manually updating every referencing row to avoid breaking foreign key constraints. To make your life easier, CedarDB now supports ON UPDATE CASCADE. Just add this clause to your foreign key, and CedarDB will automatically propagate updates from the parent table directly to its child tables.

Say you have a platform where posts reference an author’s username. If an author changes their handle, a single UPDATE handles the rest:

CREATE TABLE authors (username TEXT PRIMARY KEY);

CREATE TABLE posts (
 post_id INTEGER PRIMARY KEY,
 created_at TIMESTAMP,
 author_username TEXT REFERENCES authors(username) ON UPDATE CASCADE
);

-- Updating 'alice' to 'alice_smith' automatically updates all her posts!
UPDATE authors SET username = 'alice_smith' WHERE username = 'alice';

Note: To guarantee predictable performance and prevent runaway loops, CedarDB currently limits this to single-level cascades. An auto-updated column cannot act as the trigger for another cascade into a third table. CedarDB validates this at table creation time, so your schema stays consistent and performant.

pg_stat_database and pg_stat_activity: Observability Out of the Box

v2026-04-20

CedarDB now implements pg_stat_database and pg_stat_activity, two of Postgres’ most widely used monitoring tables. This means your existing observability stack (pgAdmin, Datadog, or any custom dashboard that speaks Postgres) just works with CedarDB, no changes required.

pg_stat_activity gives you a live window into what your database is doing right now: active queries, connection states, and client details. Spot long-running idle transactions that are holding locks or causing WAL bloat:

SELECT pid, usename, state, xact_start, now() - xact_start AS idle_duration
FROM pg_stat_activity
WHERE state = 'idle in transaction'
 AND now() - xact_start > interval '5 minutes';

pg_stat_database complements this with per-database aggregate statistics: transactions committed and rolled back, cache hit rates, tuples returned, and more. To check your database health at a glance:

SELECT datname,
 blks_hit::float / nullif(blks_hit + blks_read, 0) AS cache_hit_ratio,
 xact_commit,
 xact_rollback
FROM pg_stat_database
WHERE datname = current_database();

VACUUM (TRUNCATE): Release Disk Space Back to the OS

v2026-04-20

CedarDB’s storage footprint grows as your data grows, but until now, the main storage file never shrank. Dropped indexes, truncated tables, and deleted data all freed up pages internally, but the underlying file stayed the same size on disk. In some cases this could leave you with a much larger file than your actual data warrants, for example after building and then dropping a large index, or after rewriting ALTER operations.

VACUUM (TRUNCATE) addresses the most straightforward case: if there are unused pages at the end of the storage file, CedarDB will truncate the file and return that space to the OS.

-- After dropping a large index or table, reclaim the trailing space
VACUUM (TRUNCATE);

CedarDB also now properly returns pages to the free pool after ALTER TABLE and ALTER INDEX statements, making them eligible for truncation. More comprehensive shrinking behavior, covering space freed in the middle of the file, will follow in future releases.

Note: Only trailing unused pages can be released to the OS today. Freed space in the middle of the file is currently retained for reuse by future writes.

json_agg and json_build_array: JSON Aggregation in SQL

v2026-04-27

Two commonly used JSON aggregation functions are now available in CedarDB: json_agg and json_build_array.

json_agg aggregates rows into a JSON array, which makes it straightforward to produce nested JSON results directly from a query. This is useful for building API responses or feeding data to applications that expect JSON without an extra serialization step:

-- Return each author with a JSON array of their post titles
SELECT a.username,
 json_agg(p.title ORDER BY p.created_at DESC) AS recent_posts
FROM authors a
JOIN posts p ON p.author_username = a.username
GROUP BY a.username;

json_build_array lets you construct a JSON array from explicit values or column references in a single row:

SELECT json_build_array(user_id, username, email) AS user_tuple
FROM users
LIMIT 5;

Together, these two functions cover the most common patterns for producing JSON output directly in SQL, without needing to post-process results in application code.


That’s it for now


Questions or feedback? Join us on Slack or reach out directly.

Do you want to try CedarDB straight away? Sign up for our free Enterprise Trial below. No credit card required.

May 26, 2026

Announcing VillageSQL Server 0.0.4

Explore VillageSQL Server 0.0.4: now featuring VEF v3, custom aggregates, parameter inference, and preview capabilities like background threads.

The Autovacuum Scale Factor Problem at Scale - Know Your Defaults

In PostgreSQL, autovacuum and autoanalyze exist to clean up dead tuples (old versions of updated/deleted rows) and update query planner statistics, respectively. The challenge is running them frequently enough so that query plans and execution do not degrade after data modifications, but not so frequently as to cause excessive I/O overhead.

Databases often maintain a counter of the number of modifications to trigger these background jobs. Oracle Database and MySQL use a stale percentage (the ratio of modifications to total rows) for statistics gathering. SQL Server uses a dynamically decreasing percentage to ensure statistics do not remain stale for too long on massive tables. PostgreSQL uses a hybrid approach: a fixed base threshold combined with a scale factor (a percentage) that grows proportionally with the table size.

This hybrid approach hits the sweet spot for most workloads, but it often requires tuning based on your specific data. The key factor to watch is the amount of static, "cold" data in your tables. Because the scale factor is calculated against the total table size, a large volume of cold data will significantly inflate the threshold. This can delay maintenance on the active working set—the "hot" data actually used by your queries—leaving it vulnerable to stale statistics or bloat.

Here are the default base thresholds:

postgres=# \dconfig *autovacuum*threshold
        List of configuration parameters
             Parameter              |   Value
------------------------------------+-----------
 autovacuum_analyze_threshold       | 50
 autovacuum_vacuum_insert_threshold | 1000
 autovacuum_vacuum_max_threshold    | 100000000
 autovacuum_vacuum_threshold        | 50
(4 rows)

At first glance, this suggests tables are analyzed when 50 rows are modified, and vacuumed when 50 dead tuples accumulate (from deletes or updates) or 1,000 rows are inserted. But this is only true without the scale factor—10% for statistics, 20% for vacuum:

postgres=# \dconfig *autovacuum*scale_factor
       List of configuration parameters
               Parameter               | Value
---------------------------------------+-------
 autovacuum_analyze_scale_factor       | 0.1
 autovacuum_vacuum_insert_scale_factor | 0.2
 autovacuum_vacuum_scale_factor        | 0.2
(3 rows)

Because of the scale factor, the actual trigger thresholds increase with the size of the table. For the default settings, the formulas are:

  • Analyze when inserts or modifications > 10% table row count plus 50 rows
  • Vacuum when dead tuples (from DELETE or UPDATE) > 20% table row count plus 50 rows
  • Vacuum when inserts > 20% table row count plus 1000 rows

As these formulas show, a larger table requires a much larger accumulation of changes before maintenance fires. This is perfectly acceptable if data churn is uniformly distributed, as small changes across a massive dataset will not drastically impact query cost estimations.

However, data distribution is rarely uniform and evolves over time (e.g., seasonal sales spikes, market expanding to new countries). Because static data inflates the table row count in the formulas above, your database waits too long to trigger maintenance on the active working set.

This is the core problem with default autovacuum settings at scale: a table with 5 million rows can accumulate half a million stale modifications before the planner statistics are refreshed, and over a million dead tuples before bloat is cleaned up. The larger the table grows, the longer it waits, and the worse the situation becomes:

  • Query planner statistics become increasingly stale between analyzes.
  • The visibility map is stale for longer and index-only scans become less efficient.
  • Dead tuple bloat accumulates more between cleanups, wasting storage and slowing scans.

To demonstrate this, I have run the following script to simulate this kind of activity, constantly inserting 100 rows and then updating them. We delete nothing because we want to keep the history, but queries operate on those recent rows. Think of it like orders being entered, then processed, and remaining stored:

\c
\o tmp.log

-- run autovacuum frequently for the demo
alter system set autovacuum_naptime = '1s';
select pg_reload_conf();

-- create a table
drop table demo;
create table demo (
 id bigserial primary key, n int default 0
);
vacuum analyze demo
;

-- show the vacuum and analyze statistics,
-- insert 100 rows and update them
-- run that in a loop every 5 seconds
select relname, n_tup_ins, n_tup_upd , n_mod_since_analyze, n_ins_since_vacuum
 , autovacuum_count,  last_autovacuum  --, vacuum_count,  last_vacuum
 , autoanalyze_count, last_autoanalyze --, analyze_count, last_analyze
from pg_stat_user_tables where relid='demo'::regclass
\;
insert into demo select from generate_series(1,100)
\;
update demo set n=n+1 where id in (
select id from demo order by id desc limit 100
)
\watch i=5 c=100000

For each iteration, the total number of rows inserted (n_tup_ins) and updated (n_tup_upd), as visible in pg_stat_user_tables, increases by 100. It is the X-axis on this diagram (n_tup):

The Y-axis shows the staleness of statistics (n_mod_since_analyze) and the accumulation of dead tuples (n_mod_since_vacuum) until the auto vacuum/analyze kicks in.

With 5 million rows, the last million inserted rows accumulated dead tuples. That is 20% of the total table, as defined by the default vacuum scale factor, but it likely represents 100% of the data actively read by your queries (for example, if the application processes the last year or less of a 5-year history). Furthermore, the last 500,000 rows have completely stale statistics, the 10% default analyze scale factor, and the past months may not have the same data distribution as the previous years.

Think about the impact this has on the maximum value for an ID sequence or a created_at timestamp. It also completely skews the query planner's understanding of your data distribution (such as querying by country or day of the week). I have seen this cause severe performance issues in the real world: a retail company where shops only open on Sundays during the summer, or a trading platform suddenly processing entirely new market trends. Because the statistics are stale, the planner assumes your new, active data looks exactly like your old, historical data.

As the table grows, the impact of this bloat and staleness compounds, and performance will no longer scale. Eventually, your execution plans will flip—not because the queries changed, but simply because the estimations of the query planner are completely wrong.

For very large tables where the total size increases but the active working set is a small, predictable number of rows, you can effectively disable the scale factor and rely almost entirely on the fixed threshold:

ALTER TABLE demo SET (  
    autovacuum_analyze_scale_factor = 0.001,  
    autovacuum_analyze_threshold    = 10000,  
    autovacuum_vacuum_scale_factor  = 0.001,  
    autovacuum_vacuum_threshold     = 10000  
);  

This sets a nearly flat threshold that does not grow with the table size. The right threshold value depends on how many rows your active working set changes per hour and how much staleness you can tolerate. However, you must monitor the consequences of running autovacuum frequently on a growing table to ensure it does not cause localized I/O spikes.

Here is how the same run starts with the new table settings:

Auto analyze never left more than ~10,000 modified rows without refreshing statistics. This threshold grows slightly with the table (at 10 million rows it doubles to 20,000), but remains vastly better than the default. Auto vacuum follows the same pattern for dead tuples, but runs more frequently here because the insert-specific vacuum trigger was left at its default (1,000 rows + 20% scale factor), which only triggers the analyze threshold beyond 45,000 rows.

To address this unbounded growth natively, recent PostgreSQL versions introduced autovacuum_vacuum_max_threshold (with a default of 100 million—which is too high for my example). Rather than letting the scale factor dictate an endlessly growing target, this parameter imposes a hard ceiling on the vacuum threshold calculation. This means that even on a massive 1-billion-row table, autovacuum will forcibly trigger once dead tuples reach the configured maximum, serving as a built-in safety net. You can even adjust this globally via a simple config reload, or set it as a per-table storage parameter to enforce a strict upper limit on staleness without micromanaging scale factors across your entire schema.

Naturally, enforcing stricter thresholds—whether through these new maximum caps or manual table-level tuning—means autovacuum will run more frequently, which demands more background worker capacity. Historically, increasing autovacuum_max_workers to handle this extra load required a full database restart. However, PostgreSQL now thoughtfully splits this architecture: autovacuum_worker_slots reserves the hard upper bound of backend slots at postmaster startup, while autovacuum_max_workers dictates how many of those slots are actively used. This allows you to dynamically scale up your active workers on the fly (ALTER SYSTEM SET autovacuum_max_workers = 8; SELECT pg_reload_conf();) to absorb heavy maintenance workloads without incurring any downtime.

Alternatively, if your table has a clear time-based or categorical boundary between hot and cold data, partitioning is worth considering. Autovacuum operates per partition, so a current_year partition with 100,000 rows will trigger maintenance far sooner than a monolithic 5-million-row table, meaning the default scale factor will naturally behave exactly as intended.

May 25, 2026

Running TidesDB as a MySQL 9.7 storage engine

tidesdb-mysql is an experimental build that was developed to verify how TidesDB, the LSM-tree key/value engine, can work with MySQL 9.7 as a storage engine. The current build is v0.2.4, and it’s an experiment, not a finished product. So you can use it in your tests if you also want to try TidesDB with MySQL … Continued

The post Running TidesDB as a MySQL 9.7 storage engine appeared first on Percona.

Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster Method

A Crunchy to Percona PostgreSQL migration is more straightforward than most cross-operator moves on Kubernetes, because the Percona PostgreSQL Operator is a hard fork of the Crunchy Data PostgreSQL Operator. Same Patroni HA, same pgBackRest backups, same overall CRD shape. This post walks through the safest of the three migration paths: a standby cluster method … Continued

The post Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster Method appeared first on Percona.

May 23, 2026

May 22, 2026

MySQL 9.7.0 PGO Benchmark Analysis

Overview Servers Tested: MySQL 9.7.0 (PGO-enabled build released by Oracle) MySQL 9.7.0 Non-PGO (built without Profile-Guided Optimization — see BUILD.md) Tier Configurations: Tier 2G: 2GB InnoDB buffer pool Tier 12G: 12GB InnoDB buffer pool Tier 32G: 32GB InnoDB buffer pool   View Results 📊 Interactive Reports The benchmark reports are available as interactive HTML pages … Continued

The post MySQL 9.7.0 PGO Benchmark Analysis appeared first on Percona.