a curated list of database news from authoritative sources

May 30, 2026

BSON and OSON: documents are designed to be nested, not flat

I like vendors benchmarks as they are often good illustrations of worst practice. Rather than focusing on real implementation trade-offs, they pick an extreme case that favors their own implementation by chance but is not optimized in the competitor's — because it is simply not how that technology is meant to be used. For example, Oracle published a "2x" benchmark results using YCSB, a key-value benchmark (slide 14 here), and a "529x" test misusing the raw BSON purpose (here). Both have something in common: they compare a relational database against a document database by using neither technology as it was designed to be used — no normalized tables, no nested documents, just flat key-value fields. The latter test is even worse: it uses 1,000 top-level fields in a single document. Don't do that!

While having thousands of columns in an SQL table is usually undesirable, it's acceptable to have hundreds or even thousands of fields in a document, as they represent multiple entities and value objects. However, nobody creates a flat structure with 1000 top-level fields. The advantage of JSON is its ability to organize entities into nested sub-documents. For instance, instead of storing first_name and last_name as separate fields, you can have a name field containing a sub-object with first and last. During queries, using dot notation to reference name.first and name.last makes no difference with first_name or last_name. It simplifies reading and displaying the document, and, as it is how it should be used, the binary JSON formats are optimized for it.

To illustrate this, I create two collections, flat with one thousand fields at the top level, and nest with ten top-level fields, each containing a sub-object with ten sub-objects.


{
echo field{0..9}{0..9}{0..9}
for i in {1..100000} ; do echo {0..9}{0..9}{0..9} ; done
} | sed 's/ /,/g' |
 mongoimport --collection "flat" --type=csv --headerline --drop

{
echo field{0..9}.sub{0..9}.sub{0..9}
for i in {1..100000} ; do echo {0..9}{0..9}{0..9} ; done
} | sed 's/ /,/g' |
 mongoimport --collection "nest" --type=csv --headerline --drop

The flat documents are like:

  { _id: ObjectId('6a1a1ed41d55219677a79c22'),
  field000: 0, field001: 1, field002: 2, field003: 3, field004: 4, field005: 5, field006: 6, field007: 7, field008: 8, field009: 9,
  field010: 10, field011: 11, field012: 12, field013: 13, field014: 14, field015: 15, field016: 16, field017: 17, field018: 18, field019: 19,
  field020: 20, field021: 21, field022: 22, field023: 23, field024: 24, field025: 25, field026: 26, field027: 27, field028: 28, field029: 29,
  field030: 30, field031: 31, field032: 32, field033: 33, field034: 34, field035: 35, field036: 36, field037: 37, field038: 38,...

The nested documents are like:

  {                                                                                                                                                                                               
    _id: ObjectId('6a1a160041c5538a7e93a9c5'),
    field0: {
      sub0: { sub0: 0, sub1: 1, sub2: 2, sub3: 3, sub4: 4, sub5: 5, sub6: 6, sub7: 7, sub8: 8, sub9: 9 },
      sub1: { sub0: 10, sub1: 11, sub2: 12, sub3: 13, sub4: 14, sub5: 15, sub6: 16, sub7: 17, sub8: 18, sub9: 19 },
      sub2: { sub0: 20, sub1: 21, sub2: 22, sub3: 23, sub4: 24, sub5: 25, sub6: 26, sub7: 27, sub8: 28, sub9: 29 },
      sub3: { sub0: 30, sub1: 31, sub2: 32, sub3: 33, sub4: 34, sub5: 35, ...

BSON format (https://bsonspec.org/)

BSON is designed for efficient sequential scanning through network or disk access. Each field is stored one after another, with a small header containing the field type and name, followed by the value. Because every value type has either a fixed size (like a 64-bit integer or a double) or an explicit length prefix (like a string or a binary blob), the parser can skip over any field it doesn't care about without reading its value at all — it simply jumps forward by the declared length.

Nested sub-documents and arrays are stored the same way: the type is object or array, followed by the total byte length of the entire nested structure, followed by the nested fields themselves (for arrays, the field names are simply "0", "1", "2", and so on). That length prefix is the key optimization: if the parser is looking for a top-level field and the current field is a sub-document that doesn't match, it can skip the entire sub-document in one jump — all its nested fields, however deep — without reading a single byte inside it.

When you query a field using dot notation, such as "name.first", the parser works level by level. It scans the top-level fields looking for name. Any top-level field that doesn't match name is skipped, including any sub-documents, in a single jump. Once name is found and confirmed to be of type object, the parser steps into that sub-document and begins scanning its fields for first, again skipping anything that doesn't match. It never needs to read the sibling sub-documents of name at all.

The worst case is a field that doesn't exist: the parser must scan every field at the relevant nesting level before concluding it isn't there. But even then, sub-documents and arrays that are clearly irrelevant — because their parent name doesn't match — are each skipped in a single operation regardless of how large or deeply nested they are.

This is precisely why organizing many fields into a nested hierarchy is not just a cosmetic preference. A flat document with 1,000 top-level fields forces the parser to read 1,000 field names in the worst case. A document with 10 top-level fields, each containing 10 sub-fields, each containing 10 values, holds the same 1,000 values but the parser only ever reads at most 10 + 10 + 10 = 30 field names to locate any one of them. The BSON length prefix turns the nested structure into an implicit index, making the format genuinely faster to navigate at scale.

With the collection flat that has one thousand top-level fields from "field000" to "field999", looking for the last field "field999" takes more than one second:


db.flat.find(
 { "field999" : {$lt:42} }
).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 0,
  executionTimeMillis: 1081,
  totalKeysExamined: 0,
  totalDocsExamined: 100000,
  executionStages: {
    isCached: false,
    stage: 'COLLSCAN',
    filter: { field999: { '$lt': 42 } },
    nReturned: 0,
    executionTimeMillisEstimate: 1070,
    works: 100001,
    advanced: 0,
    needTime: 100000,
    needYield: 0,
    saveState: 64,
    restoreState: 64,
    isEOF: 1,
    direction: 'forward',
    docsExamined: 100000
  }
}

With the collection nest that has ten top-level fields from "field0" to "field9", and two levels of ten sub-objects from "sub0" to "sub9", looking for the last field "field9.sub9.sub9" is two times faster:


db.nest.find(
 { "field9.sub9.sub9" : {$lt:42} }
).explain("executionStats").executionStats
;

{
  executionSuccess: true,
  nReturned: 0,
  executionTimeMillis: 424,
  totalKeysExamined: 0,
  totalDocsExamined: 100000,
  executionStages: {
    isCached: false,
    stage: 'COLLSCAN',
    filter: { 'field9.sub9.sub9': { '$lt': 42 } },
    nReturned: 0,
    executionTimeMillisEstimate: 420,
    works: 100001,
    advanced: 0,
    needTime: 100000,
    needYield: 0,
    saveState: 21,
    restoreState: 21,
    isEOF: 1,
    direction: 'forward',
    docsExamined: 100000
  }
}

A major part of the time is fetching the document from storage. Navigating into it is a smaller part, but the difference is still clearly visible: 1,081 ms versus 424 ms for 100,000 documents, with the same number of values, and the same worst-case field to find: the last one. The only difference is how those fields are arranged. Nesting them three levels deep, with ten branches at each level, makes the collection scan faster — with no index, no schema change, and no query rewrite beyond the dot notation that you would use anyway.

This is not a micro-optimization or an edge case. Any collection scan, whether triggered by a missing index, a low-selectivity filter, or a background analytics query, pays this cost on every document it reads. The deeper and wider your documents are, the more the BSON length-prefix trick pays off, because the parser can leap over entire branches of the document tree in a single bounds check.

What about Oracle's OSON format?

Oracle Database stores JSON documents in its proprietary binary format called OSON. At first glance, OSON appears to take a fundamentally different approach: rather than storing field names inline with each value as BSON does, OSON builds a field name dictionary at the beginning of the document. Think of it like a mini-datastore with it's catalog and indexes, rather than a protocol buffer. Each field in the document body then refers to its name by a short numeric ID rather than repeating the full string. This makes individual field names cheaper to compare and reduces document size when the same field name appears many times.

Given that design, you might expect nesting to make no difference in Oracle: if every field is just a numeric ID anyway, scanning 1,000 flat fields should cost the same as scanning 10 + 10 + 10 fields spread across three levels. The dictionary lookup cost is the same either way.

In practice, however, the same experiment on Oracle shows a very similar result to MongoDB.

Querying field999 on the flat table takes about 1.6 seconds:


set autotrace traceonly

select /*+ gather_plan_statistics */ "DATA" from "ORA"."flat"
 where JSON_EXISTS("DATA",'$?(@.field999.numberOnly() < $B0)'
 passing 42 as "B0" type(strict))
;

PLAN_TABLE_OUTPUT
________________________________________________________________________________________________
SQL_ID  1h98k751w7dws, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA" from "ORA"."flat"  where
JSON_EXISTS("DATA",'$?(@.field999.numberOnly() < $B0)'  passing 42 as
"B0" type(strict))

Plan hash value: 4073748891
---------------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |      0 |00:00:01.60 |     301K|    739 |
|*  1 |  TABLE ACCESS FULL| flat |      1 |     12 |      0 |00:00:01.60 |     301K|    739 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(JSON_EXISTS2( /*+ QJSNMD_USE_SHR_XCTX */ "DATA" /*+ LOB_BY_VALUE */
              FORMAT OSON , '$?(@.field999.numberOnly() < $B0)' /* json_path_str
              $?(@.field999.numberOnly() < $B0)  */  PASSING 42 AS "B0" FALSE ON ERROR
              TYPE(STRICT) )=1)

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)


Statistics
-----------------------------------------------------------
             117  CPU used by this session
             117  CPU used when call started
             160  DB time
         1174164  RM usage by this session
               3  Requests to/from client
             738  Session total flash IO requests
         6053888  cell physical IO interconnect bytes
          301074  consistent gets
          301074  consistent gets from cache
          301074  consistent gets pin
            1068  consistent gets pin (fastpath)
             739  gcs data block access records
               1  messages sent
             747  non-idle wait count
              48  non-idle wait time
               2  opened cursors cumulative
               1  opened cursors current
             738  physical read total IO requests
         6053888  physical read total bytes
               2  process last non-idle time
          301074  session logical reads
              48  user I/O wait time
               4  user calls

Querying field9.sub9.sub9 on the nested table takes about 1.0 second:


select /*+ gather_plan_statistics */ "DATA" from "ORA"."nest"
 where JSON_EXISTS("DATA",'$?(@.field9.sub9.sub9.numberOnly() < $B0)'
 passing 42 as "B0" type(strict))
;

PLAN_TABLE_OUTPUT
__________________________________________________________________________________________________
SQL_ID  7yuwrup6rscrc, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA" from "ORA"."nest"   where
JSON_EXISTS("DATA",'$?(@.field9.sub9.sub9.numberOnly() < $B0)'
passing 42 as "B0" type(strict))

Plan hash value: 3225864993
---------------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |      0 |00:00:01.04 |     100K|   1143 |
|*  1 |  TABLE ACCESS FULL| nest |      1 |   1000 |      0 |00:00:01.04 |     100K|   1143 |
---------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(JSON_EXISTS2( /*+ QJSNMD_USE_SHR_XCTX */ "DATA" /*+ LOB_BY_VALUE */
              FORMAT OSON , '$?(@.field9.sub9.sub9.numberOnly() < $B0)' /* json_path_str
              $?(@.field9.sub9.sub9.numberOnly() < $B0)  */  PASSING 42 AS "B0" FALSE ON ERROR
              TYPE(STRICT) )=1)


Statistics
-----------------------------------------------------------
              40  CPU used by this session
              40  CPU used when call started
             106  DB time
          404934  RM usage by this session
               4  Requests to/from client
            1125  Session total flash IO requests
         9363456  cell physical IO interconnect bytes
          100221  consistent gets
          100221  consistent gets from cache
          100221  consistent gets pin
           99096  consistent gets pin (fastpath)
            1142  gcs data block access records
             158  global enqueue gets sync
             158  global enqueue releases
            1134  non-idle wait count
              71  non-idle wait time
               2  opened cursors cumulative
               1  opened cursors current
            1125  physical read total IO requests
         9363456  physical read total bytes
              61  process last non-idle time
          100221  session logical reads
              71  user I/O wait time
               4  user calls

The most telling number is not the execution time but the Buffers column: 301,074 consistent gets for the flat table versus 100,221 for the nested table. Oracle is reading three times as many data blocks for the flat documents. To understand why, I check the actual segment sizes:


with
table_seg (owner,table_name, table_blocks) as ( select owner, segment_name, sum(blocks) from dba_segments group by owner,segment_name),
lob_seg   (owner,table_name,   lob_blocks) as ( select owner, table_name,   sum(blocks) from dba_lobs join dba_segments using (owner, segment_name) group by owner,table_name),
tab_stats (owner,table_name,  avg_row_len) as ( select owner, table_name,   avg_row_len from dba_tables)
select owner, table_name, table_blocks, nvl(lob_blocks, 0) as lob_blocks, table_blocks + nvl(lob_blocks, 0) as total_blocks, avg_row_len
 from tab_stats natural join table_seg natural left join lob_seg
 where owner = 'ORA' and table_name in ('flat', 'nest')
;

OWNER    TABLE_NAME       TABLE_BLOCKS    LOB_BLOCKS    TOTAL_BLOCKS    AVG_ROW_LEN
________ _____________ _______________ _____________ _______________ ______________
ORA      flat                    1,152       317,608         318,760            166
ORA      nest                  100,992            32         101,024          7,724

The segment query reveals the full story. The nested documents have an average row length of 7,724 bytes and are stored entirely inline in the table segment, with almost no LOB blocks. The flat documents have an average row length of only 166 bytes in the table segment, but they spill into 317,608 LOB blocks. That tiny average row length for flat is not a sign of small documents — it is the sign of a pointer. The actual OSON bytes have been pushed out of the row into a separate LOB segment managed by Oracle's securefile infrastructure, and every document access requires an additional pointer dereference to fetch them from there.

The root cause is the OSON dictionary. OSON stores each distinct field name only once in a per-document dictionary and replaces every occurrence in the document body with a short numeric ID. The dictionary itself is the mechanism that compresses the field names. A flat document with 1,000 entirely unique field names like field000 through field999 requires a dictionary with 1,000 entries, one per distinct string, with no repetition to exploit. The resulting OSON document is large enough to exceed Oracle's block size and gets stored out of row as a LOB. A nested document whose entire structure uses only 20 distinct names — field0 through field9 and sub0 through sub9 — has a dictionary with just 20 short entries. Furthermore, objects that share the same field structure can reuse each other's field ID array entirely rather than repeating it. The resulting document is compact enough to be stored inline, directly in the table block, with no LOB indirection at all.

The consequence is dramatic. Scanning the flat collection means fetching a pointer from the table block, then chasing it to a LOB block, for every single one of the 100,000 documents. Scanning the nested collection means reading the document directly from the table block. That is the difference between 301,074 consistent gets and 100,221.

So the two formats reach the same conclusion by different paths. BSON benefits from nesting because the length-prefix on sub-documents lets the parser skip entire branches with a single jump, reducing the number of field names it reads. OSON benefits from nesting because the dictionary compresses repeated field names, shrinking the documents enough to keep them inline in the table and avoiding the cost of LOB storage entirely. The mechanism differs, but the advice is the same. Documents are nested structures to aggregate multiple entities and object values.

Conclusion

The practical takeaway is straightforward: model your documents the way you would naturally think about the data — as a hierarchy of related objects — and the binary JSON format your database uses will reward you for it, whether that is BSON in MongoDB or OSON in Oracle. Flat documents with hundreds of top-level fields are not just harder to read. They are measurably slower to query and heavier on storage when compression is applied. Nesting is not only good document design.

May 29, 2026

$exists and non-sparse indexes in MongoDB and in other DocumentDB

In SQL databases, NULL represents an unknown value — not the absence of a value. When a value is simply non-applicable for a given entity, the correct relational modeling approach is normalization: the entity gets no row in the relevant table at all, rather than a NULL in a column. This distinction becomes tricky with OUTER JOIN results, where the absence of a row is surfaced as NULL across all columns of the unmatched side, including key columns — making it easy to confuse "unknown value" with "no row existed."

MongoDB has its own subtlety: a field can be explicitly set to null or simply not exist in the document at all. In the BSON representation, these are distinct — one is a key with a null-typed value, the other is the absence of the key entirely. The schema is flexible: you can define a field or not. But in indexes, this distinction disappears. Except for partial indexes, indexes must have a key value for every document it covers. For documents where the field is missing, MongoDB uses null as a stand-in — the same key value used for explicit nulls. This means an index scan cannot distinguish between the two states, and resolving null vs. missing requires fetching the full document to apply a residual filter.

Consequently, a standard index scan with a filter on null or $exists is inexact: the query planner performs an index scan on the null key and then fetches the full document to verify whether the field is truly null or simply absent.

An example: { $exists: true } filter

When you query with { num: { $exists: true } }, you expect MongoDB to use an index on num. Let's test it on MongoDB, as well as some emulations: Oracle Database, Amazon DocumentDB (AWS), and DocumentDB extension on PostgreSQL (Microsoft).

Here is my test collection:

db.test.insertMany([
  { _id: 1, num: 42   },
  { _id: 2, num: 7    },
  { _id: 3, num: null },
  { _id: 4            },
  { _id: 5, num: 99   },
  { _id: 6            },
  { _id: 7, num: null },
  { _id: 8, num: 15   }
])

I have inserted eight documents:

  • four with real values (_id 1, 2, 5, 8),
  • two with the field explicitly set to null (_id 3, 7), and
  • two where the field is entirely absent (_id 4, 6).

The query { num: { $exists: true } } should return six documents — everything except _id 4 and 6.

Before touching indexes, notice that $exists is not the same as a null check:

db.test.find({ num: null })

[
  { _id: 3, num: null },
  { _id: 4 },
  { _id: 6 },
  { _id: 7, num: null }
]

db.test.find({ num: { $exists: false } })

[
  { _id: 4 },
  { _id: 6 }
]

db.test.find({ num: { $exists: true } })

[
  { _id: 1, num: 42 },
  { _id: 2, num: 7 },
  { _id: 3, num: null },
  { _id: 5, num: 99 },
  { _id: 7, num: null },
  { _id: 8, num: 15 }
]

A field set to null exists. A field not written into the document does not. This distinction is perfectly clear at the document level. At the index level, it is not.

Null in the index key is ambiguous

When MongoDB builds a B-tree index on num, it must create an entry for every document. For documents with no num field, the index key exists with a null value. For documents where num is explicitly set to null, it also stores null. Both cases produce the same index key.

Here is how the non-sparse index looks:

Non-sparse index on { num: 1 }:

  null  →  _id:3  { num: null }    explicit null
  null  →  _id:4  { }              missing field
  null  →  _id:6  { }              missing field
  null  →  _id:7  { num: null }    explicit null
  7     →  _id:2
  15    →  _id:8
  42    →  _id:1
  99    →  _id:5

The four entries under the null key are indistinguishable from the index alone. To evaluate $exists, the engine must read the actual document. This is called a residual predicate — a filter condition the index cannot resolve, deferred to a later fetch stage.

Another way to look at it: the document schema is flexible, with no structure declared upfront and fields that may or may not exist, whereas indexes are different—their schema is declared, and the key fields always exist.

MongoDB with a non-sparse index

I create a regular index, which is by default non-sparse and has one index entry per document (or more for multi-key indexes).

db.test.createIndex({ num: 1 })

db.test.find({ num: { $exists: true } }).explain("executionStats")

The execution plan shows what happens across the IXSCAN and FETCH stages:

executionStats: {
  nReturned: 6,
  totalKeysExamined: 8,
  totalDocsExamined: 8,
  executionStages: {
    stage: 'FETCH',
    filter: { num: { '$exists': true } },
    nReturned: 6,
    docsExamined: 8,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 8,
      isSparse: false,
      indexBounds: { num: [ '[MinKey, MaxKey]' ] },
      keysExamined: 8
    }
  }
}

The IXSCAN returns all 8 index entries across the full [MinKey, MaxKey] range. The FETCH stage then reads all 8 documents and applies filter: { num: { $exists: true } } as a residual predicate, discarding _id 4 and 6. Notice docsExamined: 8 but nReturned: 6 — two fetches were wasted. The index was used, but the null bucket forced unnecessary work.

MongoDB with a sparse index

A sparse index excludes documents where the indexed field is entirely absent. It does not exclude explicit null values. Documents _id 3 and 7 have num: null and are still indexed.

db.test.createIndex({ num: 1 }, { sparse: true })

db.test.find({ num: { $exists: true } }).explain("executionStats")

As I have no projection, there is still a FETCH, but only for the documents in the final result:

executionStats: {
  nReturned: 6,
  totalKeysExamined: 6,
  totalDocsExamined: 6,
  executionStages: {
    stage: 'FETCH',
    nReturned: 6,
    docsExamined: 6,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 6,
      isSparse: true,
      indexBounds: { num: [ '[MinKey, MaxKey]' ] },
      keysExamined: 6
    }
  }
}

keysExamined dropped from 8 to 6 — the two missing-field documents are not in the index. More importantly, the FETCH stage has no filter. There is no residual predicate. Every document pointed to by the sparse index either has a real value or has an explicit null — both satisfy $exists: true. The index structure itself proves the condition. The FETCH still happens because find() needs to return the documents, but it is doing useful work only, not wasted disambiguation.

Here is how the sparse index looks:

Sparse index on { num: 1 }:

  null  →  _id:3  { num: null }    explicit null — indexed
  null  →  _id:7  { num: null }    explicit null — indexed
  7     →  _id:2
  15    →  _id:8
  42    →  _id:1
  99    →  _id:5

  _id:4  { }  — not indexed
  _id:6  { }  — not indexed

The null bucket still exists in a sparse index, but it contains only explicit nulls. The ambiguity is gone.

Oracle Database

I reproduced the same on Oracle Database with the MongoDB emulation:

ora> db.test.createIndex({ num: 1 })
num_1
ora> db.test.find({ num: { $exists: true } }).explain("executionStats")
{
  queryPlanner: {
    namespace: 'ora.test',
    parsedQuery: { num: { '$exists': true } },
    rewrittenQuery: { num: { '$exists': true } },
    generatedSql: `select "DATA",rawtohex("RESID"),"ETAG" from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) ' type(strict))`,
    winningPlan: ' Plan Hash Value  : 3552627291 \n' +
      '\n' +
      '--------------------------------------------------------------------------------------------------\n' +
      '| Id  | Operation                             | Name            | Rows | Bytes | Cost | Time     |\n' +
      '--------------------------------------------------------------------------------------------------\n' +
      '|   0 | SELECT STATEMENT                      |                 |    1 | 24501 |    2 | 00:00:01 |\n' +
      '|   1 |   TABLE ACCESS BY INDEX ROWID BATCHED | test            |    1 | 24501 |    2 | 00:00:01 |\n' +
      '|   2 |    HASH UNIQUE                        |                 |    1 | 24501 |      |          |\n' +
      '| * 3 |     INDEX RANGE SCAN (MULTI VALUE)    | $ora:test.num_1 |    1 |       |    1 | 00:00:01 |\n' +
      '--------------------------------------------------------------------------------------------------\n' +
      '\n' +
      'Predicate Information (identified by operation id):\n' +
      '------------------------------------------\n' +
      `* 3 - access(JSON_QUERY("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$."num"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)\n` +
      "  MULTIVALUE)>HEXTORAW('01'))\n" +
      '\n' +
      '\n' +
      'Notes\n' +
      '-----\n' +
      '- Dynamic sampling used for this statement ( level = 2 )\n' +
      '\n'
  },
  serverInfo: { host: 'localhost', port: 27017, version: '7.0.22' },
  ok: 1
}
ora>

It doesn't display the execution statistics, but I can get it from the SQL endpoint:

sql> select /*+ gather_plan_statistics */ "DATA",rawtohex("RESID"),"ETAG" from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) ' type(strict));

DATA                    RAWTOHEX("RESID")    ETAG
_______________________ ____________________ ___________________________________
{"_id":3,"num":null}    03C104               523160F3D2777CB2E0637B5B000A71CD
{"_id":7,"num":null}    03C108               523160F3D27F7CB2E0637B5B000A71CD
{"_id":2,"num":7}       03C103               523160F3D2757CB2E0637B5B000A71CD
{"_id":8,"num":15}      03C109               523160F3D2817CB2E0637B5B000A71CD
{"_id":1,"num":42}      03C102               523160F3D2737CB2E0637B5B000A71CD
{"_id":5,"num":99}      03C106               523160F3D27B7CB2E0637B5B000A71CD

6 rows selected.

sql> select * from dbms_xplan.display_cursor(format=>'allstats last');

PLAN_TABLE_OUTPUT
____________________________________________________________________________________________________________________
SQL_ID  c08vsvqpn75vw, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA",rawtohex("RESID"),"ETAG"
from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) '
type(strict))

Plan hash value: 3552627291

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |      6 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| test            |      1 |      1 |      6 |00:00:00.01 |       2 |
|   2 |   HASH UNIQUE                       |                 |      1 |      1 |      6 |00:00:00.01 |       1 |
|*  3 |    INDEX RANGE SCAN (MULTI VALUE)   | $ora:test.num_1 |      1 |      1 |      6 |00:00:00.01 |       1 |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("test"."SYS_NC00005$">HEXTORAW('01'))

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)


26 rows selected.

The index range scan returned 6 entries, as if it were a sparse index. We cannot create a sparse index on Oracle Database:

ora> db.test.dropIndex({ num: 1 })
{ nIndexesWas: 2, ok: 1 }

ora> db.test.createIndex({ num: 1 } , { sparse: 1 })
MongoServerError[MONGO-67]: Unsupported index option: sparse

Amazon DocumentDB (AWS)

AWS DocumentDB speaks the MongoDB wire protocol but is built on a completely different architecture. The storage layer is distributed like Aurora, replicated across three availability zones. The query planner and storage engine are specific to Amazon DocumentDB and deliver performance characteristics that differ from both MongoDB and standard PostgreSQL.

Non-sparse index on Amazon DocumentDB

The { num: { $exists: true } } query does not use the non-sparse index (created as db.test.createIndex({ num: 1 })) on Amazon DocumentDB (tested on version 8, planner version 3):

queryPlanner: {
  plannerVersion: 3,
  winningPlan: { stage: 'COLLSCAN', filter: { num: { '$exists': true } } }
},
executionStats: {
  nReturned: '6',
  executionTimeMillis: '14.121',
  planningTimeMillis: '14.019',
  executionStages: {
    stage: 'COLLSCAN',
    nReturned: '6',
    executionTimeMillisEstimate: '0.025'
  }
}

The index is completely abandoned. The planner chose a full collection scan.

Sparse index on Amazon DocumentDB

With the index created as db.test.createIndex({ num: 1 }, { sparse: true }), the index is used:

queryPlanner: {
  plannerVersion: 3,
  winningPlan: { stage: 'IXSCAN', indexName: 'num_1', direction: 'forward' }
},
executionStats: {
  nReturned: '6',
  executionTimeMillis: '10.034',
  planningTimeMillis: '8.128',
  executionStages: {
    stage: 'IXSCAN',
    nReturned: '6',
    executionTimeMillisEstimate: '1.842',
    indexName: 'num_1',
    direction: 'forward'
  }
}

Every entry in the sparse index provably satisfies $exists: true. It scans 6 index entries and returns 6 documents. While a sparse index is optional in MongoDB, it is mandatory in Amazon DocumentDB to use an index for this query at all.

Microsoft DocumentDB on PostgreSQL

Microsoft DocumentDB is implemented as an open-source PostgreSQL extension, accessed via the MongoDB wire protocol through a compatible endpoint.

With DocumentDB on PostgreSQL, a sparse index is not required for an optimal access path. I created the index as db.test.createIndex({ num: 1 }) and used a hint to force the index, since on a small collection the cost-based planner would otherwise prefer a sequential scan:

db.test.find(
  { num: { $exists: true } }
).hint("num_1").explain("executionStats")

(Be careful when using a hint with MongoDB queries, as it may change the result, limiting the scan to what is indexed)

The execution plan reads only the necessary entries from the index:

executionStats: {
  nReturned: Long('6'),
  executionTimeMillis: 0.093,
  executionStartAtTimeMillis: 0.089,
  totalDocsExamined: Long('6'),
  totalKeysExamined: Long('6'),
  executionStages: {
    stage: 'FETCH',
    nReturned: Long('6'),
    executionTimeMillis: 0.093,
    executionStartAtTimeMillis: 0.089,
    totalKeysExamined: 6,
    numBlocksFromCache: 24,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: Long('6'),
      executionTimeMillis: 0.093,
      executionStartAtTimeMillis: 0.089,
      indexName: 'num_1',
      totalKeysExamined: 6,
      numBlocksFromCache: 24
    }
  }
}

It shows the same count for index entries (totalKeysExamined: 6) and documents fetched (totalDocsExamined: 6).

Here we can go further: we can bypass the MongoDB layer entirely and query PostgreSQL directly, seeing exactly what the database engine sees.

To understand why, we can look at the underlying implementation from the PostgreSQL catalog:

\d documentdb_data.documents_15

           Table "documentdb_data.documents_15"
     Column      |  Type  | Collation | Nullable | Default
-----------------+--------+-----------+----------+---------
 shard_key_value | bigint |           | not null |
 object_id       | bson   |           | not null |
 document        | bson   |           | not null |
Indexes:
    "collection_pk_15" PRIMARY KEY, btree (shard_key_value, object_id)
    "documents_rum_index_47" documentdb_extended_rum
        (document bson_extended_rum_composite_path_ops
         (pathspec='[ "num" ]', tl='2691'))

There are no individual columns for num, name, or any other document field. The entire document is stored as a single bson blob in the document column. PostgreSQL has no native knowledge of what is inside it. The collection name test maps to documents_15, where 15 is the collection's internal identifier.

The index is not a standard PostgreSQL B-tree. It is an Extended RUM index — documentdb_extended_rum — with a custom operator class: bson_extended_rum_composite_path_ops. RUM is an extension of GIN (Generalized Inverted Index) that adds support for ordering, range scans, and additional per-entry metadata. The operator class is the critical piece: it knows how to extract the num field from the opaque BSON blob and store it in a structure PostgreSQL can search. pathspec='[ "num" ]' tells it which field to index.

We can obtain the PostgreSQL execution plan directly using the DocumentDB API. I disabled sequential scans to override the cost-based planner's preference on this small table:

postgres=# set enable_seqscan to off;
           explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
  'test',
  '{
    "find": "test",
    "filter": { "num": { "$exists": true } }
  }'::documentdb_core.bson
);
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Index Scan using num_1 on documentdb_data.documents_15 collection (actual time=0.036..0.040 rows=6 loops=1)
   Output: document
   Index Cond: (collection.document @>= '{ "num" : { "$minKey" : 1 } }'::bson)
   Buffers: shared hit=3
 Planning:
   Buffers: shared hit=26
 Planning Time: 0.416 ms
 Execution Time: 0.052 ms
(8 rows)

The $exists: true predicate has been translated into a PostgreSQL index condition: document @>= '{ "num": { "$minKey": 1 } }'::bson. This uses a custom BSON operator @>= meaning "document has field num with a value greater than or equal to MinKey."

MinKey is a special BSON sentinel value that sits below every other BSON value in the type ordering. The condition @>= MinKey therefore means "field num exists and has any BSON value at all" — which is exactly $exists: true. Existence becomes a range scan from the minimum possible value: an elegant encoding.

The RUM index is path-based and only creates entries for paths that actually exist in documents. However, documents where num is absent also have their index entry, that are scanned by the opposite filter { num: {$exists": false} }:

postgres=# explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
  'test',
  '{
    "find": "test",
    "filter": { "num": { "$exists": false } }
  }'::documentdb_core.bson
);
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Index Scan using num_1 on documentdb_data.documents_15 collection (actual time=0.030..0.033 rows=2 loops=1)
   Output: document
   Index Cond: (collection.document @? '{ "num" : false }'::bson)
   Buffers: shared hit=3
 Planning:
   Buffers: shared hit=80
 Planning Time: 0.282 ms
 Execution Time: 0.070 ms
(8 rows)

This has read the two rows (rows=2) without the num field.

The complete picture

Here is the summary for { $exists: true } queries on the example above:

Non-sparse index Sparse index
MongoDB FETCH ← IXSCAN, 8 keys, 8 docs, residual filter, 2 wasted fetches FETCH ← IXSCAN, 6 keys, 6 docs, no residual filter
Amazon DocumentDB (AWS) COLLSCAN, no index, 8 docs IXSCAN, 6 keys
DocumentDB on PostgreSQL (Microsoft) FETCH ← IXSCAN, 6 keys, 6 docs, no residua... (truncated)

May 28, 2026

Percona Operator for PostgreSQL 3.0.0: Hard Fork, OLM Scoping, Major Upgrades

The Percona Operator for PostgreSQL 3.0.0 is here. This is the release that completes the hard fork of the operator from the Crunchy Data PostgreSQL Operator into a fully independent project, with a dedicated upstream.pgv2.percona.com API group for the inherited CRDs, an automatic CRD-rename rollout for existing 2.x installs on upgrade, and a public roadmap … Continued

The post Percona Operator for PostgreSQL 3.0.0: Hard Fork, OLM Scoping, Major Upgrades appeared first on Percona.

Guide your Amazon Aurora MySQL migration with Kiro powers

Today, we announce the Amazon Aurora MySQL power for Kiro. The power connects Kiro’s AI agent to Aurora MySQL and pairs live database access with curated best-practice guidance. You describe what you need in natural language. The agent generates the API calls, SQL, and configuration for you to review and run. In this post, we walk through how the power guides a production migration from Amazon Relational Database Service (Amazon RDS) for MySQL 8.0 to Aurora MySQL through four phases: assessment, replica creation, promotion, and post-cutover validation.

May 27, 2026

Optimize costs in Amazon Aurora

By implementing modern optimization techniques for Aurora, you can achieve additional cost reduction beyond traditional methods alone. This isn’t only about spending less—it’s about building a more efficient, scalable, and resilient database environment. In this post, we show you a structured approach to optimizing Amazon Aurora database costs. It outlines specific strategies, implementation steps, and best practices across different optimization areas.

Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Backup-Restore and PV Reuse

A Percona PostgreSQL operator pgBackRest restore is the simplest way to move off the Crunchy Data PostgreSQL Operator: take a full Crunchy backup, point the new Percona cluster’s dataSource at the existing pgBackRest archive, and the cluster bootstraps from it before its first start. This post covers that path, plus a second option, persistent-volume reuse, for cases … Continued

The post Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Backup-Restore and PV Reuse appeared first on Percona.

CedarDB: Features of April 2026

This post takes a closer look at some of the most impactful features we have shipped in CedarDB across our recent releases. Whether you have been following along closely or are just catching up, here is a deeper look at the additions we are most excited about.

Set-Returning Functions: Lock-Step Evaluation

v2026-04-20

When handling bulk data transformations or speeding up database inserts, a popular developer trick is to use multiple set-returning functions side-by-side in the SELECT clause to “zip” arrays together into individual rows. To improve performance and scalability, PostgreSQL changed its behavior regarding this in version 10, shifting from generating a massive cross-product to a much more efficient lock-step evaluation (see the PostgreSQL 10 Release Notes).

To guarantee seamless compatibility and keep your queries lightning-fast, CedarDB evaluates multiple set-returning functions in the SELECT list in the exact same lock-step manner.

-- Zipping up arrays in lock-step to quickly generate rows
SELECT UNNEST('{alice, bob, charlie}'::TEXT[]) AS user_name, 
 UNNEST('{active, inactive, active}'::TEXT[]) AS status, 
 UNNEST('{150, 200, 350}'::INT[]) AS score;

Instead of exploding into 27 rows of useless, cross-joined data, CedarDB cleanly steps through the arrays row-by-row to return exactly 3 perfectly paired rows. If you rely on array unnesting to batch your application’s database inserts, you can now enjoy highly scalable performance and behavior that is completely identical to modern PostgreSQL.

Note: This lock-step evaluation applies to other set-returning functions you might already know! Alongside UNNEST, you can use functions like generate_series, json_array_elements, or regexp_matches to efficiently generate and zip your data.

ON UPDATE CASCADE: Keep Your Data in Sync Automatically

v2026-04-20

Changing core identifiers, like a user’s handle or a department code, used to mean manually updating every referencing row to avoid breaking foreign key constraints. To make your life easier, CedarDB now supports ON UPDATE CASCADE. Just add this clause to your foreign key, and CedarDB will automatically propagate updates from the parent table directly to its child tables.

Say you have a platform where posts reference an author’s username. If an author changes their handle, a single UPDATE handles the rest:

CREATE TABLE authors (username TEXT PRIMARY KEY);

CREATE TABLE posts (
 post_id INTEGER PRIMARY KEY,
 created_at TIMESTAMP,
 author_username TEXT REFERENCES authors(username) ON UPDATE CASCADE
);

-- Updating 'alice' to 'alice_smith' automatically updates all her posts!
UPDATE authors SET username = 'alice_smith' WHERE username = 'alice';

Note: To guarantee predictable performance and prevent runaway loops, CedarDB currently limits this to single-level cascades. An auto-updated column cannot act as the trigger for another cascade into a third table. CedarDB validates this at table creation time, so your schema stays consistent and performant.

pg_stat_database and pg_stat_activity: Observability Out of the Box

v2026-04-20

CedarDB now implements pg_stat_database and pg_stat_activity, two of Postgres’ most widely used monitoring tables. This means your existing observability stack (pgAdmin, Datadog, or any custom dashboard that speaks Postgres) just works with CedarDB, no changes required.

pg_stat_activity gives you a live window into what your database is doing right now: active queries, connection states, and client details. Spot long-running idle transactions that are holding locks or causing WAL bloat:

SELECT pid, usename, state, xact_start, now() - xact_start AS idle_duration
FROM pg_stat_activity
WHERE state = 'idle in transaction'
 AND now() - xact_start > interval '5 minutes';

pg_stat_database complements this with per-database aggregate statistics: transactions committed and rolled back, cache hit rates, tuples returned, and more. To check your database health at a glance:

SELECT datname,
 blks_hit::float / nullif(blks_hit + blks_read, 0) AS cache_hit_ratio,
 xact_commit,
 xact_rollback
FROM pg_stat_database
WHERE datname = current_database();

VACUUM (TRUNCATE): Release Disk Space Back to the OS

v2026-04-20

CedarDB’s storage footprint grows as your data grows, but until now, the main storage file never shrank. Dropped indexes, truncated tables, and deleted data all freed up pages internally, but the underlying file stayed the same size on disk. In some cases this could leave you with a much larger file than your actual data warrants, for example after building and then dropping a large index, or after rewriting ALTER operations.

VACUUM (TRUNCATE) addresses the most straightforward case: if there are unused pages at the end of the storage file, CedarDB will truncate the file and return that space to the OS.

-- After dropping a large index or table, reclaim the trailing space
VACUUM (TRUNCATE);

CedarDB also now properly returns pages to the free pool after ALTER TABLE and ALTER INDEX statements, making them eligible for truncation. More comprehensive shrinking behavior, covering space freed in the middle of the file, will follow in future releases.

Note: Only trailing unused pages can be released to the OS today. Freed space in the middle of the file is currently retained for reuse by future writes.

json_agg and json_build_array: JSON Aggregation in SQL

v2026-04-27

Two commonly used JSON aggregation functions are now available in CedarDB: json_agg and json_build_array.

json_agg aggregates rows into a JSON array, which makes it straightforward to produce nested JSON results directly from a query. This is useful for building API responses or feeding data to applications that expect JSON without an extra serialization step:

-- Return each author with a JSON array of their post titles
SELECT a.username,
 json_agg(p.title ORDER BY p.created_at DESC) AS recent_posts
FROM authors a
JOIN posts p ON p.author_username = a.username
GROUP BY a.username;

json_build_array lets you construct a JSON array from explicit values or column references in a single row:

SELECT json_build_array(user_id, username, email) AS user_tuple
FROM users
LIMIT 5;

Together, these two functions cover the most common patterns for producing JSON output directly in SQL, without needing to post-process results in application code.


That’s it for now


Questions or feedback? Join us on Slack or reach out directly.

Do you want to try CedarDB straight away? Sign up for our free Enterprise Trial below. No credit card required.

May 26, 2026

Announcing VillageSQL Server 0.0.4

Explore VillageSQL Server 0.0.4: now featuring VEF v3, custom aggregates, parameter inference, and preview capabilities like background threads.

The Autovacuum Scale Factor Problem at Scale - Know Your Defaults

In PostgreSQL, autovacuum and autoanalyze exist to clean up dead tuples (old versions of updated/deleted rows) and update query planner statistics, respectively. The challenge is running them frequently enough so that query plans and execution do not degrade after data modifications, but not so frequently as to cause excessive I/O overhead.

Databases often maintain a counter of the number of modifications to trigger these background jobs. Oracle Database and MySQL use a stale percentage (the ratio of modifications to total rows) for statistics gathering. SQL Server uses a dynamically decreasing percentage to ensure statistics do not remain stale for too long on massive tables. PostgreSQL uses a hybrid approach: a fixed base threshold combined with a scale factor (a percentage) that grows proportionally with the table size.

This hybrid approach hits the sweet spot for most workloads, but it often requires tuning based on your specific data. The key factor to watch is the amount of static, "cold" data in your tables. Because the scale factor is calculated against the total table size, a large volume of cold data will significantly inflate the threshold. This can delay maintenance on the active working set—the "hot" data actually used by your queries—leaving it vulnerable to stale statistics or bloat.

Here are the default base thresholds:

postgres=# \dconfig *autovacuum*threshold
        List of configuration parameters
             Parameter              |   Value
------------------------------------+-----------
 autovacuum_analyze_threshold       | 50
 autovacuum_vacuum_insert_threshold | 1000
 autovacuum_vacuum_max_threshold    | 100000000
 autovacuum_vacuum_threshold        | 50
(4 rows)

At first glance, this suggests tables are analyzed when 50 rows are modified, and vacuumed when 50 dead tuples accumulate (from deletes or updates) or 1,000 rows are inserted. But this is only true without the scale factor—10% for statistics, 20% for vacuum:

postgres=# \dconfig *autovacuum*scale_factor
       List of configuration parameters
               Parameter               | Value
---------------------------------------+-------
 autovacuum_analyze_scale_factor       | 0.1
 autovacuum_vacuum_insert_scale_factor | 0.2
 autovacuum_vacuum_scale_factor        | 0.2
(3 rows)

Because of the scale factor, the actual trigger thresholds increase with the size of the table. For the default settings, the formulas are:

  • Analyze when inserts or modifications > 10% table row count plus 50 rows
  • Vacuum when dead tuples (from DELETE or UPDATE) > 20% table row count plus 50 rows
  • Vacuum when inserts > 20% table row count plus 1000 rows

As these formulas show, a larger table requires a much larger accumulation of changes before maintenance fires. This is perfectly acceptable if data churn is uniformly distributed, as small changes across a massive dataset will not drastically impact query cost estimations.

However, data distribution is rarely uniform and evolves over time (e.g., seasonal sales spikes, market expanding to new countries). Because static data inflates the table row count in the formulas above, your database waits too long to trigger maintenance on the active working set.

This is the core problem with default autovacuum settings at scale: a table with 5 million rows can accumulate half a million stale modifications before the planner statistics are refreshed, and over a million dead tuples before bloat is cleaned up. The larger the table grows, the longer it waits, and the worse the situation becomes:

  • Query planner statistics become increasingly stale between analyzes.
  • The visibility map is stale for longer and index-only scans become less efficient.
  • Dead tuple bloat accumulates more between cleanups, wasting storage and slowing scans.

To demonstrate this, I have run the following script to simulate this kind of activity, constantly inserting 100 rows and then updating them. We delete nothing because we want to keep the history, but queries operate on those recent rows. Think of it like orders being entered, then processed, and remaining stored:

\c
\o tmp.log

-- run autovacuum frequently for the demo
alter system set autovacuum_naptime = '1s';
select pg_reload_conf();

-- create a table
drop table demo;
create table demo (
 id bigserial primary key, n int default 0
);
vacuum analyze demo
;

-- show the vacuum and analyze statistics,
-- insert 100 rows and update them
-- run that in a loop every 5 seconds
select relname, n_tup_ins, n_tup_upd , n_mod_since_analyze, n_ins_since_vacuum
 , autovacuum_count,  last_autovacuum  --, vacuum_count,  last_vacuum
 , autoanalyze_count, last_autoanalyze --, analyze_count, last_analyze
from pg_stat_user_tables where relid='demo'::regclass
\;
insert into demo select from generate_series(1,100)
\;
update demo set n=n+1 where id in (
select id from demo order by id desc limit 100
)
\watch i=5 c=100000

For each iteration, the total number of rows inserted (n_tup_ins) and updated (n_tup_upd), as visible in pg_stat_user_tables, increases by 100. It is the X-axis on this diagram (n_tup):

The Y-axis shows the staleness of statistics (n_mod_since_analyze) and the accumulation of dead tuples (n_mod_since_vacuum) until the auto vacuum/analyze kicks in.

With 5 million rows, the last million inserted rows accumulated dead tuples. That is 20% of the total table, as defined by the default vacuum scale factor, but it likely represents 100% of the data actively read by your queries (for example, if the application processes the last year or less of a 5-year history). Furthermore, the last 500,000 rows have completely stale statistics, the 10% default analyze scale factor, and the past months may not have the same data distribution as the previous years.

Think about the impact this has on the maximum value for an ID sequence or a created_at timestamp. It also completely skews the query planner's understanding of your data distribution (such as querying by country or day of the week). I have seen this cause severe performance issues in the real world: a retail company where shops only open on Sundays during the summer, or a trading platform suddenly processing entirely new market trends. Because the statistics are stale, the planner assumes your new, active data looks exactly like your old, historical data.

As the table grows, the impact of this bloat and staleness compounds, and performance will no longer scale. Eventually, your execution plans will flip—not because the queries changed, but simply because the estimations of the query planner are completely wrong.

For very large tables where the total size increases but the active working set is a small, predictable number of rows, you can effectively disable the scale factor and rely almost entirely on the fixed threshold:

ALTER TABLE demo SET (  
    autovacuum_analyze_scale_factor = 0.001,  
    autovacuum_analyze_threshold    = 10000,  
    autovacuum_vacuum_scale_factor  = 0.001,  
    autovacuum_vacuum_threshold     = 10000  
);  

This sets a nearly flat threshold that does not grow with the table size. The right threshold value depends on how many rows your active working set changes per hour and how much staleness you can tolerate. However, you must monitor the consequences of running autovacuum frequently on a growing table to ensure it does not cause localized I/O spikes.

Here is how the same run starts with the new table settings:

Auto analyze never left more than ~10,000 modified rows without refreshing statistics. This threshold grows slightly with the table (at 10 million rows it doubles to 20,000), but remains vastly better than the default. Auto vacuum follows the same pattern for dead tuples, but runs more frequently here because the insert-specific vacuum trigger was left at its default (1,000 rows + 20% scale factor), which only triggers the analyze threshold beyond 45,000 rows.

To address this unbounded growth natively, recent PostgreSQL versions introduced autovacuum_vacuum_max_threshold (with a default of 100 million—which is too high for my example). Rather than letting the scale factor dictate an endlessly growing target, this parameter imposes a hard ceiling on the vacuum threshold calculation. This means that even on a massive 1-billion-row table, autovacuum will forcibly trigger once dead tuples reach the configured maximum, serving as a built-in safety net. You can even adjust this globally via a simple config reload, or set it as a per-table storage parameter to enforce a strict upper limit on staleness without micromanaging scale factors across your entire schema.

Naturally, enforcing stricter thresholds—whether through these new maximum caps or manual table-level tuning—means autovacuum will run more frequently, which demands more background worker capacity. Historically, increasing autovacuum_max_workers to handle this extra load required a full database restart. However, PostgreSQL now thoughtfully splits this architecture: autovacuum_worker_slots reserves the hard upper bound of backend slots at postmaster startup, while autovacuum_max_workers dictates how many of those slots are actively used. This allows you to dynamically scale up your active workers on the fly (ALTER SYSTEM SET autovacuum_max_workers = 8; SELECT pg_reload_conf();) to absorb heavy maintenance workloads without incurring any downtime.

Alternatively, if your table has a clear time-based or categorical boundary between hot and cold data, partitioning is worth considering. Autovacuum operates per partition, so a current_year partition with 100,000 rows will trigger maintenance far sooner than a monolithic 5-million-row table, meaning the default scale factor will naturally behave exactly as intended.

May 25, 2026

Running TidesDB as a MySQL 9.7 storage engine

tidesdb-mysql is an experimental build that was developed to verify how TidesDB, the LSM-tree key/value engine, can work with MySQL 9.7 as a storage engine. The current build is v0.2.4, and it’s an experiment, not a finished product. So you can use it in your tests if you also want to try TidesDB with MySQL … Continued

The post Running TidesDB as a MySQL 9.7 storage engine appeared first on Percona.

Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster Method

A Crunchy to Percona PostgreSQL migration is more straightforward than most cross-operator moves on Kubernetes, because the Percona PostgreSQL Operator is a hard fork of the Crunchy Data PostgreSQL Operator. Same Patroni HA, same pgBackRest backups, same overall CRD shape. This post walks through the safest of the three migration paths: a standby cluster method … Continued

The post Migrate from Crunchy Data PostgreSQL Operator to Percona PostgreSQL Operator: Standby Cluster Method appeared first on Percona.

May 23, 2026