a curated list of database news from authoritative sources

January 15, 2026

Percona Operator for PostgreSQL 2025 Wrap Up and What We Are Focusing on Next

In 2025, the Percona Operator for PostgreSQL put most of its energy into the things that matter when PostgreSQL is running inside real Kubernetes clusters: predictable upgrades, safer backup and restore, clearer observability, and fewer surprises from image and HA version drift.  Backups and restores got more resilient and more controllable In March, Operator 2.6.0 […]

Announcing Percona ClusterSync for MongoDB: The Open Source Trail to Freedom

Migrating mission-critical databases is often compared to changing the engines on a plane while it’s in mid-flight. For MongoDB users, this challenge has been historically steep, often involving complex workarounds or proprietary tools that keep you locked into a specific ecosystem. Today, we are thrilled to announce the General Availability of Percona ClusterSync for MongoDB […]

January 14, 2026

Debugging regressions with Postgres in IO-bound sysbench

I explained in this post that there is a possible performance regression for Postgres with IO-bound sysbench. It arrived in Postgres 16 and remains in Postgres 18. I normally run sysbench with a cached database, but I had a spare server so I repeated tests with an IO-bound workload.

The bad news for me is that I need to spend more time explaining the problem. The good news for me is that I learn more about Postgres by doing this. And to be clear, I have yet to explain the regression but this post documents my debugging efforts.

sysbench

This post explains how I use sysbench. Note that modern sysbench is a framework for running benchmarks and it comes with many built-in tests. By framework I mean that it includes much plumbing for several DBMS and you can write tests in Lua with support for the Lua JIT so the client side of the tests use less CPU.

The sysbench framework has been widely used in the MySQL community for a long time. Originally it hard-coded one (or perhaps a few tests) and in too many cases by sysbench people mean the classic (original) sysbench transaction that is a mix of point queries, range queries and writes. The classic transaction is now implemented by oltp_read_write.lua with help from oltp_common.lua.

The oltp_read_write.lua test is usually not good at detecting regressions but in this case (the problem motivating me to write this blog post) it has detected the regression.

How to find regressions

I have opinions about how to run DBMS benchmarks. For example, be careful about running read-only benchmarks for an LSM because the state (shape) of the LSM tree can have a large impact on CPU overhead and the LSM state (shape) might be in the same (bad or good) state for the duration of the test. With read-write tests the LSM state should cycle between better and worse states.

But traditional b-tree storage engines also have states. One aspect of that is MVCC debt -- write-heavy tests increase the debt and the engine doesn't always do a great job of limiting that debt. So I run sysbench tests in a certain order to both create and manage MVCC debt while trying to minimize noise.

The order of the tests is listed here and the general sequence is:

  1. Load and index the tables
  2. Run a few read-only tests to let MVCC debt get reduced if it exists
  3. Optionally run more read-only tests (I usually skip these)
  4. Run write-heavy tests
  5. Do things to reduce MVCC debt (see here for Postgres and MySQL)
  6. Run read-only tests (after the tables+indexes have been subject to random writes)
  7. Run delete-only and then insert-only tests
Results

The possible regressions are:
  • update-zipf in Postgres 16, 17 and 18
  • write-only in Postgres 16, 17 and 18
  • read-write in Postgres 16, 17 and 18
  • insert in Postgres 18 - this reproduces in 18.0 and 18.1

Legend:
* relative QPS for Postgres 16.11, 17.7 and 18.1
* relative QPS is (QPS for my version / QPS for Postgres 15.15)

16.11   17.7    18.1
1.01    1.00    1.01    update-inlist
1.06    1.03    1.04    update-index
1.04    1.04    1.04    update-nonindex
1.00    1.08    1.07    update-one
0.85    0.72    0.71    update-zipf
0.88    0.86    0.84    write-only_range=10000
0.71    0.82    0.81    read-write_range=100
0.74    0.78    0.82    read-write_range=10
1.06    1.03    1.00    delete
1.02    1.00    0.80    insert

Note that when the insert test has a relative QPS of 0.80 for Postgres 18.1, then 18.1 gets 80% of the throughput vs Postgres 15.5. So this is a problem to figure out.

Explaining the insert test for Postgres 18.1

I wrote useful but messy Bash scripts to make it easier to run and explain sysbench results. One of the things I do is collect results from vmstat and iostat per test and then summarize average and normalized values from them where normalized values are: (avg from iostat or vmstat / QPS). 

And then I compute relative values for them which is the following and the base case here is Postgres 15.15: (value from my version / value from the base case) 

From the results below I see that from Postgres 17.7 to 18.1
  • throughput decreases by ~20% in 18.1
  • cpu/o increased by ~17% in 18.1
  • cs/o increased by ~2.6% in 18.1
  • r/o increased by ~14% in 18.1
  • rKB/o increased by ~22% in 18.1
  • wKB/o increased by ~27% in 18.1
So Postgres 18.1 does a lot more IO during the insert test. Possible reasons are that either insert processing changed to be less efficient or there is more MVCC debt at the start of the insert test from the write-heavy tests that precede it and thus there is more IO from write-back and vacuum during the insert test which reduces throughput. At this point I don't know. A next step for me is to repeat the benchmark with the delete test removed as that immediately precedes the insert test (see here).

Legend:
* cpu/o - CPU per operation. This includes us and sy from vmstat but
          it isn't CPU microseconds.
* cs/o - context switches per operation
* r/o - reads from storage per operation
* rKB/o - KB read from storage per operation
* wKB/o - KB written to storage per operation
* o/s - operations/s, QPS, throughput
* dbms - Postgres version

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000780        5.017   0.407   8.937   24.2    34164   15.15
0.000755        5.090   0.409   9.002   24.648  34833   16.11
0.000800        5.114   0.418   9.146   24.626  34195   17.7
0.000939        5.251   0.477   11.186  31.197  27304   18.1

--- relative to first result
0.97            1.01    1.00    1.01    1.02    1.02    16.11
1.03            1.02    1.03    1.02    1.02    1.00    17.7
1.20            1.05    1.17    1.25    1.29    0.80    18.1

I also have results from pg_stat_all_tables collected at the end of the delete and insert tests for Postgres 17.7 and 18.1. Then I computed the difference as (results after insert test - results after delete test). From the results below I don't see a problem. Note that the difference for n_dead_tup is zero. While my analysis was limited to one of the 8 tables used for the benchmark, the values for the other 7 tables are similar.

The columns with a non-zero difference for Postgres 17.7 are:
  3996146   n_tup_ins
  3996146   n_live_tup
  3996146   n_mod_since_analyze
  3996146   n_ins_since_vacuum

The columns with a non-zero difference for Postgres 18.1 are:
  3191278   n_tup_ins
  3191278   n_live_tup
  3191278   n_mod_since_analyze
  3191278   n_ins_since_vacuum

Explaining the other tests for Postgres 16, 17 and 18

Above I listed the order in which the write-heavy tests are run. The regressions occur on the update-zipf, write-only and read-write tests. All of these follow the update-one test. 
  • There is no regression for write-only and read-write when I change the test order so these run prior to update-one and update-zipf
  • There is a regression if either or both of update-one and update-zip are run prior to write-only and read-write
I assume that either the amount of MVCC debt created by update-one and update-zipf is larger starting with Postgres 16 or that starting in Postgres 16 something changed so that Postgres is less effective at dealing with that MVCC debt.

Note that the update-one and update-zipf tests are a bit awkward. For the update-one test all updates are limited to one row per table (the first row in the table). And for the update-zipf test a zipfian distribution is used to select the rows to update. So in both cases a small number of rows receive most of the updates.

Results from pg_stat_all_tables collected immediately prior to update-one and then after update-zipf are here for Postgres 15.15 and 16.11. Then I computed the difference as (results after insert test - results after delete test). 

The columns with a non-zero difference for Postgres 15.15 are:
  23747485   idx_scan
  23747485   idx_tup_fetch
  23747485   n_tup_upd
  22225868   n_tup_hot_upd
    -69576   n_live_tup
   3273433   n_dead_tup
  -1428177   n_mod_since_analyze

The columns with a non-zero difference for Postgres 16.11 are:
  23102012   idx_scan
  23102012   idx_tup_fetch
  23102012   n_tup_upd
  21698107   n_tup_hot_upd
   1403905   n_tup_newpage_upd
     -3568   n_live_tup
   2983730   n_dead_tup
  -2064095   n_mod_since_analyze

I also have the vmstat and iostat data for each of the tests. In all cases, after update-one, the amount of CPU and IO per operation increases after Postgres 15.15.

For update-one results don't change much across versions

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000281        12.492  0.001   0.011   0.741   162046  15.15
0.000278        12.554  0.001   0.012   0.746   162415  16.11
0.000253        11.028  0.002   0.029   0.764   174715  17.7
0.000254        10.960  0.001   0.011   0.707   172790  18.1

--- relative to first result
0.99            1.00    1.00    1.09    1.01    1.00    16.11
0.90            0.88    2.00    2.64    1.03    1.08    17.7
0.90            0.88    1.00    1.00    0.95    1.07    18.1

For update-zipf the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000721        6.517   0.716   6.583   18.811  41531   15.15
0.000813        7.354   0.746   7.838   22.746  35497   16.11
0.000971        5.679   0.796   10.492  27.858  29700   17.7
0.000966        5.718   0.838   10.354  28.965  29289   18.1

--- relative to first result
1.13            1.13    1.04    1.19    1.21    0.85    16.11
1.35            0.87    1.11    1.59    1.48    0.72    17.7
1.34            0.88    1.17    1.57    1.54    0.71    18.1

For write-only_range=10000 the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000784        5.881   0.799   7.611   26.835  30174   15.15
0.000920        6.315   0.85    10.033  32      26444   16.11
0.001003        5.883   0.871   11.147  33.142  25889   17.7
0.000999        5.905   0.891   10.988  34.443  25232   18.1

--- relative to first result
1.17            1.07    1.06    1.32    1.19    0.88    16.11
1.28            1.00    1.09    1.46    1.24    0.86    17.7
1.27            1.00    1.12    1.44    1.28    0.84    18.1

For read-write_range=100 the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.001358        7.756   1.758   25.654  22.615  31574   15.15
0.001649        8.111   1.86    30.99   35.413  22264   16.11
0.001629        7.609   1.856   29.901  28.681  25906   17.7
0.001646        7.484   1.978   29.456  29.138  25573   18.1

--- relative to first result
1.21            1.05    1.06    1.21    1.57    0.71    16.11
1.20            0.98    1.06    1.17    1.27    0.82    17.7
1.21            0.96    1.13    1.15    1.29    0.81    18.1

For read-write_range=10 the amount of CPU and IO per operation increases after Postgres 15.15

--- absolute
cpu/o           cs/o    r/o     rKB/o   wKB/o   o/s     dbms
0.000871        6.441   1.096   10.238  21.426  37141   15.15
0.001064        6.926   1.162   13.761  29.738  27540   16.11
0.001074        6.461   1.181   14.759  29.338  28839   17.7
0.001045        6.497   1.222   14.726  28.632  30431   18.1

--- relative to first result
1.22            1.08    1.06    1.34    1.39    0.74    16.11
1.23            1.00    1.08    1.44    1.37    0.78    17.7
1.20            1.01    1.11    1.44    1.34    0.82    18.1

The Importance of Realistic Benchmark Workloads

Unveiling the Limits: A Performance Analysis of MongoDB Sharded Clusters with plgm In any database environment, assumptions are the enemy of stability. Understanding the point at which a system transitions from efficient to saturated is essential for maintaining uptime and ensuring a consistent and reliable user experience. Identifying these limits requires more than estimation—it demands […]

Database Transactions

What are database transactions and how do SQL databases isolate one transaction from another?

January 13, 2026

BSON vs OSON: Different design goals

There was a discussion on LinkedIn comparing OSON (the binary JSON in Oracle Database) and BSON (the binary JSON in MongoDB). To be clear, BSON and OSON aren’t directly comparable because they serve different purposes:

  • BSON is a general-purpose binary serialization format designed for transport and storage efficiency, similar to Protobuf. It prioritizes fast encode/decode operations for network transmission and disk storage across diverse applications.
  • OSON is Oracle Database's internal JSON storage format, specifically engineered for database operations, with extensive metadata to enable efficient field querying and navigation without fully decoding the document.

They share two objectives but make different trade-offs:

  • Compactness Through Binary Encoding – Both formats achieve compactness without compression through binary encoding and type-specific representations. However, BSON prioritizes encoding speed with a straightforward binary representation, whereas OSON achieves greater compactness through local dictionary compression of field names within documents.
  • Partial Navigation Without Full Scanning – Both enable efficient partial document traversal, but with different approaches. BSON uses simple size prefixes to enable jumping between fields. A BSON document is meant to be stored as a single variable-size block (possible in the WiredTiger B-Tree). OSON implements comprehensive metadata structures—such as a tree segment and jumpable offsets—so that it can reduce random reads when stored in multiple fixed-size blocks (the Oracle Database storage model).

They differ in one major design objective regarding updates:

  • BSON follows MongoDB's "store together what is accessed together" principle, optimizing for full-document read patterns and full-document writes to disk as a single contiguous block, rather than in-place modifications. Standard BSON documents are not designed for efficient partial updates outside of an in-memory mutable object representation.
  • OSON is designed to optimize database operations, including in-place modifications without additional memory allocation.

I mentioned the Java driver in the previous post. The Oracle Java driver supports fast access via OracleJsonObject.get(), which avoids instantiating a new object and uses the internal metadata for navigation. I wanted to see how this works, but because the Oracle Database Java driver isn’t open source, I tried Python instead, where the Oracle driver is open source. However, it doesn’t provide an equivalent to OracleJsonObject.get(), so you must decode and encode to a Python dict, similar to BSON.

I ran the following program to measure the time and size to encode and decode BSON and OSON:


import time    
import bson    
import oracledb    
from bson.codec_options import CodecOptions    
from bson.raw_bson import RawBSONDocument    

# Prepare RawBSON codec options once for lazy BSON decoding    
raw_codec_options = CodecOptions(document_class=RawBSONDocument)    

def generate_large_document(num_fields, field_length):    
    long_str = "a" * field_length    
    return {f"field_{i+1}": long_str for i in range(num_fields)}    

def compare_bson_oson(document, connection, iterations=100):    
    """Compare BSON and OSON encode/decode, plus access after decode."""    

    middle_field_name = f"field_{len(document)//2}"    

    # Precompute sizes    
    bson_data_sample = bson.encode(document)    
    bson_size = len(bson_data_sample)    

    oson_data_sample = connection.encode_oson(document)    
    oson_size = len(oson_data_sample)    

    # Timers    
    bson_encode_total = 0.0    
    bson_decode_total = 0.0    
    bson_access_after_decode_total = 0.0    
    bson_decode_raw_total = 0.0    
    bson_access_raw_total = 0.0    

    oson_encode_total = 0.0    
    oson_decode_total = 0.0    
    oson_access_after_decode_total = 0.0    

    for _ in range(iterations):    
        # BSON encode    
        start = time.perf_counter()    
        bson_data = bson.encode(document)    
        bson_encode_total += (time.perf_counter() - start)    

        # BSON decode raw (construct RawBSONDocument)    
        start = time.perf_counter()    
        raw_bson_doc = RawBSONDocument(bson_data, codec_options=raw_codec_options)    
        bson_decode_raw_total += (time.perf_counter() - start)    

        # BSON access single field from raw doc    
        start = time.perf_counter()    
        _ = raw_bson_doc[middle_field_name]    
        bson_access_raw_total += (time.perf_counter() - start)    

        # BSON full decode    
        start = time.perf_counter()    
        decoded_bson = bson.decode(bson_data)    
        bson_decode_total += (time.perf_counter() - start)    

        # BSON access after full decode    
        start = time.perf_counter()    
        _ = decoded_bson[middle_field_name]    
        bson_access_after_decode_total += (time.perf_counter() - start)    

        # OSON encode    
        start = time.perf_counter()    
        oson_data = connection.encode_oson(document)    
        oson_encode_total += (time.perf_counter() - start)    

        # OSON full decode    
        start = time.perf_counter()    
        decoded_oson = connection.decode_oson(oson_data)    
        oson_decode_total += (time.perf_counter() - start)    

        # OSON access after full decode    
        start = time.perf_counter()    
        _ = decoded_oson[middle_field_name]    
        oson_access_after_decode_total += (time.perf_counter() - start)    

    return (    
        bson_encode_total,    
        bson_decode_total,    
        bson_access_after_decode_total,    
        bson_size,    
        bson_decode_raw_total,    
        bson_access_raw_total    
    ), (    
        oson_encode_total,    
        oson_decode_total,    
        oson_access_after_decode_total,    
        oson_size    
    )    

def run_multiple_comparisons():    
    iterations = 100    
    num_fields_list = [10, 100, 1000]    
    field_sizes_list = [1000, 10000, 100000]  # small length → large length    

    # Init Oracle client    
    oracledb.init_oracle_client(config_dir="/home/opc/Wallet")    
    connection = oracledb.connect(    
        user="franck",    
        password="My Strong P455w0rd",    
        dsn="orcl_tp"    
    )    

    print(f"{'Format':<6} {'Fields':<8} {'FieldLen':<10} "    
          f"{'Encode(s)':<12} {'Decode(s)':<12} {'Access(s)':<12} {'Size(bytes)':<12} "    
          f"{'DecRaw(s)':<12} {'AccRaw(s)':<12} "    
          f"{'EncRatio':<9} {'DecRatio':<9} {'SizeRatio':<9}")    
    print("-" * 130)    

    for field_length in field_sizes_list:    
        for num_fields in num_fields_list:    
            document = generate_large_document(num_fields, field_length)    
            bson_res, oson_res = compare_bson_oson(document, connection, iterations)    

            enc_ratio = oson_res[0] / bson_res[0] if bson_res[0] > 0 else 0    
            dec_ratio = oson_res[1] / bson_res[1] if bson_res[1] > 0 else 0    
            size_ratio = oson_res[3] / bson_res[3] if bson_res[3] > 0 else 0    

            # BSON row    
            print(f"{'BSON':<6} {num_fields:<8} {field_length:<10} "    
                  f"{bson_res[0]:<12.4f} {bson_res[1]:<12.4f} {bson_res[2]:<12.4f} {bson_res[3]:<12} "    
                  f"{bson_res[4]:<12.4f} {bson_res[5]:<12.4f} "    
                  f"{'-':<9} {'-':<9} {'-':<9}")    

            # OSON row    
            print(f"{'OSON':<6} {num_fields:<8} {field_length:<10} "    
                  f"{oson_res[0]:<12.4f} {oson_res[1]:<12.4f} {oson_res[2]:<12.4f} {oson_res[3]:<12} "    
                  f"{'-':<12} {'-':<12} "    
                  f"{enc_ratio:<9.2f} {dec_ratio:<9.2f} {size_ratio:<9.2f}")    

    connection.close()    

if __name__ == "__main__":    
    run_multiple_comparisons()    

I got the following results:

  • Encode(s) — total encode time over 100 iterations.
  • Decode(s) — full decode into Python objects (dict).
  • Access(s) — access to a field from Python objects (dict).
  • DecRaw(s) — creation of a RawBSONDocument for BSON (no equivalent for OSON).
  • AccRaw(s) — single middle‑field access from a raw document (lazy decode for BSON).
  • Ratios — OSON time / BSON time.

$ TNS_ADMIN=/home/opc/Wallet python bson-oson.py  
Format Fields   FieldLen   Encode(s)    Decode(s)    Access(s)    Size(bytes)  DecRaw(s)    AccRaw(s)    EncRatio  DecRatio  SizeRatio  
----------------------------------------------------------------------------------------------------------------------------------  
BSON   10       1000       0.0005       0.0005       0.0000       10146        0.0002       0.0011       -         -         -  
OSON   10       1000       0.0013       0.0006       0.0000       10206        -            -            2.44      1.26      1.01  
BSON   100      1000       0.0040       0.0043       0.0000       101497       0.0012       0.0101       -         -         -  
OSON   100      1000       0.0103       0.0053       0.0000       102009       -            -            2.58      1.25      1.01  
BSON   1000     1000       0.0422       0.0510       0.0000       1015898      0.0098       0.0990       -         -         -  
OSON   1000     1000       0.1900       0.0637       0.0000       1021912      -            -            4.50      1.25      1.01  
BSON   10       10000      0.0019       0.0017       0.0000       100146       0.0005       0.0025       -         -         -  
OSON   10       10000      0.0045       0.0021       0.0000       100208       -            -            2.36      1.27      1.00  
BSON   100      10000      0.0187       0.0177       0.0000       1001497      0.0026       0.0225       -         -         -  
OSON   100      10000      0.1247       0.0241       0.0000       1002009      -            -            6.66      1.36      1.00  
BSON   1000     10000      0.2709       0.2439       0.0001       10015898     0.0235       0.2861       -         -         -  
OSON   1000     10000      14.4215      0.3185       0.0001       10021912     -            -            53.23     1.31      1.00  

Important nuance: In Python, oracledb.decode_oson() yields a standard dict, so we cannot measure lazy access as we can with the Java driver’s OracleJsonObject.get() method, which can skip object instantiation. We measured it for one field from the raw BSON to show that there is a cost, which is higher than accessing the dict, though still less than a microsecond for large documents. In general, since you store together what is accessed together, it often makes sense to decode to an application object.

Encoding OSON is slower than BSON, especially for large documents, by design—because it computes navigation metadata for faster reads—whereas BSON writes a contiguous field stream. For the largest case, encoding takes ~15 seconds over 100 iterations, which translates to milliseconds per operation.

Decoding BSON is marginally faster than OSON, but the difference is negligible since all decoding times are under a millisecond. OSON’s extra metadata helps mainly when reading a few fields from a large document, as it avoids instantiating an immutable object.

Raw BSON provides faster "decoding" (as it isn’t actually decoded), but slower field access. Still, this difference—less than a millisecond—is negligible except when accessing many fields, in which case you should decode once to a Python dict.

This test illustrates the different design goals of BSON and OSON. I used the Python driver to illustrage what an application does: get a document from a query and manipulate as an application object. On the server, it is different, and queries may modify a single field in a document. OSON will do it directly on the OSON datatype, as it has all metadata, whereas BSON will be accessed though mutable BSON object.

AWS Organizations now supports upgrade rollout policy for Amazon Aurora and Amazon RDS automatic minor version upgrades

AWS Organizations now supports an upgrade rollout policy, a new capability that provides a streamlined solution for managing automatic minor version upgrades across your database fleet. This feature supports Amazon Aurora MySQL-Compatible Edition and Amazon Aurora PostgreSQL-Compatible Edition and Amazon RDS database engines MySQL, PostgreSQL, MariaDB, SQL Server, Oracle, and Db2. It eliminates the operational overhead of coordinating upgrades across hundreds of resources and accounts while validating changes in less critical environments before reaching production. In this post, we explore how upgrade rollout policy works, its key benefits, and how you can use it to implement a systematic approach to database maintenance across your organization.

January 12, 2026

Using PXC Replication Manager to Auto Manage Both Source and Replica Failover in Galera-Based Environments

In this blog post, we will be discussing the PXC Replication Manager script/tool which basically facilitates both source and replica failover when working with multiple PXC clusters, across different DC/Networks connected via asynchronous replication mechanism. Such topologies emerge from requirements like database version upgrades, reporting or streaming for applications, separate disaster recovery or backup solutions, […]

January 10, 2026

Which Document class is best to use in Java to read MongoDB documents?

TL;DR: the answer is in the title, use Document.

BSON is a serialization format, similar to protobuf, designed for efficient document storage on disk and transfer over the network as a byte stream. Instead of scanning and rewriting the entire byte sequence to access its contents (fields, arrays, subdocuments), you work with an in-memory object that exposes methods to read and write fields efficiently. On MongoDB's server side, that's the mutable BSON object. On the client side, the drivers provide a similar API. Here are the five object types that implement the BSON interface in Java

  • Document is the recommendation for most applications. It provides the best balance of flexibility, ease of use, and functionality.

Only consider the other classes when you have specific requirements:

  • BsonDocument: When you need strict BSON type safety rather than application types.

  • RawBsonDocument: When you need to access the raw bytes of the document rather than field values.

  • JsonObject: When working exclusively with JSON strings, as plain text.

  • BasicDBObject: Only for legacy code migration.

RawBsonDocument is used internally—for example, for client-side encryption and change streams—whereas the other classes can all be used directly in MongoDB driver operations. Your choice mainly impacts how you construct, manipulate, and access document data in your application code. They are documented:

I reviewed the documentation and code to better understand their differences. Especially when comparing with other databases which have different design and API. In this post I provide a detailed description of those classes, what they are, how they are implemented, what they provide, and when to use them

Detailed description

Here's a detailed comparison of the five document classes available in the MongoDB Java Driver and when to use each:

Document (org.bson)

Document is a flexible representation of a BSON document that implements Map<String, Object>. It uses a LinkedHashMap<String, Object> internally to maintain insertion order.

Key characteristics:

  • Loosely-typed: Values are stored as Object, allowing you to use standard Java types (String, Integer, Date, etc.)
  • Flexible: Easy to work with dynamically structured documents
  • Map interface: Provides all standard Map operations

Document is the recommended default choice for most applications. Use it when you want a flexible and concise data representation that's easy to work with.

BsonDocument (org.bson)

BsonDocument is a type-safe container for BSON documents that implements Map<String, BsonValue>. It also uses a LinkedHashMap, but stores BsonValue types.

Key characteristics:

  • Type-safe: All values must be wrapped in BSON library types (BsonString, BsonInt32, BsonDocument, etc.)
  • Stricter API: Provides compile-time type safety but requires more verbose code
  • Map interface: Implements Map<String, BsonValue>

Use BsonDocument when you need a type-safe API and want explicit control over BSON types. This is particularly useful when you need to ensure precise type handling or when working with APIs that require BsonDocument.

RawBsonDocument (org.bson)

RawBsonDocument is an immutable BSON document represented using only raw bytes. It stores the BSON document as a byte array without parsing it.

Key characteristics:

  • Immutable: All mutation operations throw UnsupportedOperationException
  • Lazy parsing: Data is only parsed when accessed, making it very efficient for pass-through scenarios, when not for accessing each individual field
  • Memory efficient: Stores raw bytes, avoiding object allocation overhead
  • Can decode to other types: Provides decode() method to convert to other document types when needed

Use RawBsonDocument when you need maximum performance and memory efficiency for whole-document operations. It is particularly useful when reading documents from MongoDB and passing them to another system unchanged, when working with large documents that you don’t need to parse, when building high-performance data pipelines where parsing overhead matters, and when you need an immutable document representation.

JsonObject (org.bson.json)

JsonObject is a wrapper class that holds a JSON object string. It simply stores the JSON as a String.

Key characteristics:

  • Does not implement Map: It's just a string wrapper with validation
  • No parsing required: Avoids conversion to Map structure if you're just working with JSON
  • JSON-focused: Designed for applications that primarily work with JSON strings
  • Supports Extended JSON: Works with MongoDB Extended JSON format

Use JsonObject when you want to work directly with JSON strings and avoid the overhead of converting to and from Map objects. This is ideal for REST APIs that consume and produce JSON, for logging or persisting documents as JSON strings, and for applications that primarily handle JSON and do not require programmatic field-level access.

BasicDBObject (com.mongodb)

BasicDBObject is a legacy BSON object implementation that extends BasicBSONObject and implements the DBObject interface.

Key characteristics:

  • Legacy class: Exists for backward compatibility with older driver versions
  • Does not implement Map: Only implements the DBObject interface, lacking modern Map convenience methods
  • Binary compatibility concerns: Implements an interface rather than extending a class, which can cause compatibility issues

Only use BasicDBObject when migrating from a legacy driver version (pre-3.0). The documentation explicitly recommends avoiding this class for new development due to its limitations.

Conversion Between Types

All of these classes implement the Bson interface, which allows them to be used interchangeably in MongoDB operations (but without the same performance). You can also convert between types:

  • BsonDocument.parse(json) to create from JSON
  • RawBsonDocument.decode(codec) to convert RawBsonDocument to another type
  • Document can be converted to BsonDocument through codec operations
  • JsonObject.getJson() to extract the JSON string

Comparing RawBsonDocument and Document

RawBsonDocument stores BSON as a raw byte array and does not deserialize it in advance. When you access a field, it creates a BsonBinaryReader and scans the document sequentially, reading each field’s type and name until it finds the requested key. Only the matching field is decoded using RawBsonValueHelper.decode, while all other fields are skipped without parsing. For nested documents and arrays, it reads only their sizes and wraps the corresponding byte ranges in new RawBsonDocument or RawBsonArray instances, keeping their contents as raw bytes. This approach provides fast access for a single field lookup, while being memory-efficient and keeping the document immutable, which is ideal for large documents where only a few fields are needed or for documents that are mostly passed through without inspection.

In contrast, Document uses a fully deserialized LinkedHashMap<String, Object>. When a Document is created, all fields are eagerly parsed into Java objects. Field access and containsKey operations are simple HashMap lookups, and the document is fully mutable, supporting standard map operations such as put, remove, and clear. This design consumes more memory but is better suited to small or medium-sized documents, scenarios where many fields are accessed, or cases where the document needs to be modified frequently.

Finally, Document doesn't use RawBsonDocument for parsing or accessing fields since doing so would be inefficient, and the two serve different purposes.

Comparison with Oracle (OSON) and PostgreSQL (JSONB)

Neither Oracle nor PostgreSQL provides BSON as they use OSON and JSONB, so there's no BsonDocument or RawBsonDocument equivalent.

In Oracle’s JDBC driver, the closest equivalent to a Document is OracleJsonObject, one of the OracleJsonValue types, which can be exposed directly as a javax.json.JsonObject or mapped domain object. This API works directly on the underlying OSON bytes without fully parsing the entire document into an intermediate data structure. OSON is more than a raw serialization: it carries its own local dictionary of distinct field names, a sorted array of hash codes for those names, and compact field ID arrays and value offsets, enabling the driver to locate a field in place by binary‑searching the field ID array, and jumping to the right offset.

If JSON text is needed instead of the object model, the equivalent is simply to use ResultSet.getString(), which will convert the OSON image to JSON text on the client.

PostgreSQL’s JDBC driver, by contrast, offers no native Java JSON object API for either json or jsonb columns: values are always returned as text, so the application must parse the string into its own document model using a separate JSON library. Even when using PostgreSQL’s binary JSONB storage, none of the binary efficiency crosses the wire (See JSONB vs. BSON: Tracing PostgreSQL and MongoDB Wire Protocols), and client code still performs a full parse before accessing individual fields.

Conclusion

MongoDB’s main advantage for modern applications—whatever the data types or workloads—is the ability to work with data directly through your domain objects, without an intermediate object-relational mapping layer or view. Use the Document class as your document object model (DOM). It offers flexibility, map-style access, and a natural Java interface, while the driver transparently converts BSON from the network into objects your application can use immediately.

January 09, 2026

Cloudspecs: Cloud Hardware Evolution Through the Looking Glass

This paper (CIDR'26) presents a comprehensive analysis of cloud hardware trends from 2015 to 2025, focusing on AWS and comparing it with other clouds and on-premise hardware.

TL;DR: While network bandwidth per dollar improved by one order of magnitude (10x), CPU and DRAM gains have been much more modest. Most surprisingly, NVMe storage performance in the cloud has stagnated since 2016. Check out the NVMe SSD discussion below for data on this anomaly.


CPU Trends

Multi-core parallelism has skyrocketed in the cloud. Maximum core counts have increased by an order of magnitude over the last decade. The largest AWS instance u7in now boasts 448 cores. However, simply adding cores hasn't translated linearly into value. To measure real evolution, the authors normalized benchmarks (SPECint, TPC-H, TPC-C) by instance cost. SPECint benchmarking shows that cost-performance improved roughly 3x over ten years. A huge chunk of that gain comes from AWS Graviton. Without Graviton, the gain drops to roughly 2x. For in-memory database benchmarks, gains were even lower (2x–2.5x), likely due to memory and cache latency bottlenecks.

On-prem hardware comparison shows that this stagnation is not cloud price gouging. Historically, Moore's Law and Dennard scaling doubled cost-performance every two years (which would have sum up to 32x gain over a decade). However, an analysis of on-premise AMD server CPUs reveals a similar slump, only a 1.7x gain from 2017 to 2025.


Memory Trends

DRAM capacity per dollar has effectively flatlined. The only significant improvement was the 2016 introduction of memory-optimized x instances, which offered ~3.3x more GiB-hours/$ than compute-optimized peers. While absolute single-socket bandwidth jumped ~5x (93 GiB/s to 492 GiB/s) as servers moved from DDR3 to DDR5, the cost-normalized gain is only 2x.


Historical data suggests commodity DRAM prices dropped 3x over the decade. But in the last three months, due to AI-driven demand, DDR5 prices rose sharply, further limiting effective memory gains.


Network Trends

We have good news here, finally. Network bandwidth per dollar exploded by 10x. And absolute speeds went from 10 Gbit/s to 600 Gbit/s (60x).


These gains were not universal though. Generic instances saw little change. The gains were driven by network-optimized n instances (starting with the c5n in 2018) powered by proprietary Nitro cards.


NVMe Trends

NVMe SSDs are the biggest surprise. Unlike CPUs and memory, where cloud trends mirror on-prem hardware, NVMe performance in AWS has largely stagnated. The first NVMe-backed instance family, i3, appeared in 2016. As of 2025, AWS offers 36 NVMe instance families. Yet the i3 still delivers the best I/O performance per dollar by nearly 2x. 


SSD capacity has stagnated since 2019 and I/O throughput since 2016. This sharply contrasts with on-prem hardware, where SSD performance doubled twice (PCIe 4 and PCIe 5) in the same timeframe. The gap between cloud and on-premise NVMe is widening rapidly.

This  price/performance gap likely explains the accelerating push toward disaggregated storage. When local NVMe is expensive and underperforming, remote storage starts to look attractive. The paper speculates that with network speeds exploding and NVMe stagnating, architectures may shift further. For systems like Snowflake, using local NVMe for caching might no longer be worth the complexity compared to reading directly from S3 with fast networks.


Discussion

I think the main takeaway is that uniform hardware scaling in the cloud is over. Moore's Law no longer lifts all boats. Performance gains now come from specialization, especially networking (e.g., Graviton, Nitro, Accelerators).

In my HPTS 2024 review, I noted that contrary to the deafening AI hype, the real excitement in the hallways was about hardware/software codesign. This paper validates that sentiment. With general-purpose CPU and memory cost-performance stagnating, future databases must be tightly integrated with specialized hardware and software capabilities to provide value. I think the findings here will refuel that trend.

A key open question is why massive core counts deliver so little value. Where is the performance lost? Possible explanations include memory bandwidth limits, poor core-to-memory balance, or configuration mismatches. But I think the most likely culprit is software. Parallel programming remains hard, synchronization is expensive, and many systems fail to scale beyond a modest number of cores. We may be leaving significant performance on the table simply because our software cannot effectively utilize the massive parallelism now available.

The paper comes with an interactive tool, Cloudspecs, built on DuckDB-WASM (yay!). This allows you to run SQL queries over the dataset directly in the browser to visualize these trends. The figures in the PDF actually contain clickable link symbols that take you to the specific query used to generate that chart. Awesome reproducibility!

Aleksey and I did a live-reading of the paper. As usual, we had a lot to argue about. I'll add a recording of our discussion on YouTube when it becomes available, and here is a link to my annotated paper.

January 08, 2026

Unlock Amazon Aurora’s Advanced Features with Standard JDBC Driver using AWS Advanced JDBC Wrapper

In this post, we show how you can enhance your Java application with the cloud-based capabilities of Amazon Aurora by using the JDBC Wrapper. Simple code changes shared in this post can transform a standard JDBC application to use fast failover, read/write splitting, IAM authentication, AWS Secrets Manager integration, and federated authentication.

Postgres vs tproc-c on a small server

This is my first post with results from tproc-c using HammerDB. This post has results for Postgres. 

tl;dr - across 8 workloads (low and medium concurrency, cached database to IO-bound)

  • there might be a regression for Postgres 14.20 and 15.15 in one workload
  • there are improvements, some big, for Postgres 17 and 18 in most workloads

Builds, configuration and hardware

I compiled Postgres from source using -O2 -fno-omit-frame-pointer for versions 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1.

The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.

For versions prior to 18, the config file is named conf.diff.cx10a_c8r32 and they are as similar as possible and here for versions 1213141516 and 17.

For Postgres 18 the config file is named conf.diff.cx10b_c8r32 and adds io_mod='sync' which matches behavior in earlier Postgres versions.

Benchmark

The benchmark is tproc-c from HammerDB. The tproc-c benchmark is derived from TPC-C.

The benchmark was run for several workloads:
  • vu=1, w=100 - 1 virtual user, 100 warehouses
  • vu=6, w=100 - 6 virtual users, 100 warehouses
  • vu=1, w=1000 - 1 virtual user, 1000 warehouses
  • vu=6, w=1000 - 6 virtual users, 1000 warehouses
  • vu=1, w=2000 - 1 virtual user, 2000 warehouses
  • vu=6, w=2000 - 6 virtual users, 2000 warehouses
  • vu=1, w=4000 - 1 virtual user, 4000 warehouses
  • vu=6, w=4000 - 6 virtual users, 4000 warehouses
The benchmark is run by this script which depends on scripts here.
  • stored procedures are enabled
  • partitioning is used for when the warehouse count is >= 1000
  • a 5 minute rampup is used
  • then performance is measured for 120 minutes

Results

All artifacts from the tests are here. A spreadsheet with the charts and numbers is here.

My analysis at this point is simple -- I only consider average throughput. Eventually I will examine throughput over time and efficiency (CPU and IO).

On the charts that follow y-axis starts at 0.9 to improve readability. The y-axis shows relative throughput. There might be a regression when the relative throughput is less than 1.0. There might be an improvement when it is > 1.0. The relative throughput is:
(NOPM for a given version / NOPM for Postgres 12.22)

Results: vu=1, w=100

Summary:
  • no regressions, no improvements
Results: vu=6, w=100

Summary:
  • no regressions, no improvements
Results: vu=1, w=1000

Summary:
  • no regressions, improvements in Postgres 17 and 18
Results: vu=6, w=1000

Summary:
  • no regressions, improvements in Postgres 16, 17 and 18
Results: vu=1, w=2000

Summary:
  • possible regressions in Postgres 14 and 15, improvements in 13, 16, 17\ and 18
Results: vu=6, w=2000

Summary:
  • no regressions, improvements in Postgres 13 through 18
Results: vu=1, w=4000

Summary:
  • no regressions, improvements in Postgres 13 through 18
Results: vu=6, w=4000

Summary:
  • no regressions, improvements in Postgres 16 through 18

Introducing TRAE SOLO integration with Supabase

TRAE IDE now integrates with Supabase, enabling developers to manage their entire backend infrastructure without leaving the IDE. Browse databases, manage storage, configure auth, and more.

The Sauna Algorithm: Surviving Asynchrony Without a Clock

While sweating it out in my gym's sauna recently, I found a neat way to illustrate the happened-before relationship in distributed systems. Imagine I suffer from a medical condition called dyschronometria, which makes me unable to perceive time reliably, such that 10 seconds and 10 minutes feel exactly the same to me.

In this scenario, the sauna lacks a visible clock. I'm flying blind here, yet I want to leave after a healthy session. If I stay too short, I get no health benefits. If I stay too long, I risk passing out on the floor. The question becomes: How do I, a distributed node with no local clock, ensure operating within a safety window in an asynchronous environment?

Thankfully, the sauna has a uniform arrival of people. Every couple of minutes, a new person walks in. These people don't suffer from dyschronometria and they stay for a healthy session, roughly 10 minutes.

My solution is simple: I identify the first person to enter after me, and I leave when he leaves. I don't measure time at all. Instead, I rely solely on the causal relationship between this person's arrival and departure. The presence of this other person acts as a reference, letting me safely synchronize my action in an entirely asynchronous setting.

Murat's Sauna Algorithm

  1. Enter the sauna.
  2. Identify the very next person to enter (let's call him A).
  3. Wait until A gets up to leave.
  4. Exit immediately after A.

By anchoring my exit to the departure of a person who arrived after me, I guarantee that my stay overlaps entirely with his healthy duration, plus the delta between my arrival and his. I have successfully bounded my sauna/execution time using causal links rather than a physical timer.

Now, for the distributed systems purists who might object to step 4 (arguing that "exiting immediately" implies measuring a time interval of zero): I can mend this. I exit after Person A leaves, but before the next person leaves. That keeps us in the realm of strictly causal reasoning.

Consider what would go wrong if I anchored my departure to someone already in the sauna when I entered. That person might be near the end of their session, causing me to leave in just a few seconds. In distributed systems, this is akin to reading the state of an outdated node to make a decision, leading to inconsistent outcomes. Anchoring only to events that occur after my entry ensures I account for all relevant causality.

There is, of course, a safe way to rely on the people already inside. I could take a snapshot of the room upon entry, identify everyone present, and refuse to leave until the entire set of them has departed. This guarantees a sufficient duration (assuming at least one person recently arrived), but it requires more memory. Instead of tracking a single reference point (the newcomer), I would have to maintain state for everyone in the sauna.

Let's connect back to distributed systems. In an asynchronous system, nodes rely on causal relationships between events to order them. Just as I watch the newcomer to determine my exit, a node can observe the messages sent by other nodes to determine safe ordering. For example, in a distributed database, a transaction may want to commit only after all causally prior transactions have completed, ensuring consistency without requiring synchronized clocks. Reasoning about happened-before relationships allows asynchronous systems to coordinate safely without synchronized clocks. See discussions on logical clocks, vector clocks, and hybrid logical clocks in my previous posts.

Before anybody gets too concerned, no, I don't actually suffer from dyschronometria. However, back when I was a professor, my PhD students often joked that I operated on logical clocks rather than physical ones.

Also, in reality, the gym's sauna does have a clock visible from the door. I check it, and aim to stay for about 15 minutes. But I still insist on using my sauna algorithm, well at least the first three steps, because I am a competitive SOB. I try to outcompete the average person and, knock on sauna-bench wood, I haven't had to break my rule yet. It would be absolutely catastrophic if everyone started using my algorithm, though. We'd all be waiting for the next person to leave, resulting in a deadlock. Maybe the memory-costly snapshot algorithm is better after all, as it avoids this problem.

Related post: Consistent snapshot analogies