January 15, 2026
Announcing Percona ClusterSync for MongoDB: The Open Source Trail to Freedom
January 14, 2026
Debugging regressions with Postgres in IO-bound sysbench
I explained in this post that there is a possible performance regression for Postgres with IO-bound sysbench. It arrived in Postgres 16 and remains in Postgres 18. I normally run sysbench with a cached database, but I had a spare server so I repeated tests with an IO-bound workload.
The bad news for me is that I need to spend more time explaining the problem. The good news for me is that I learn more about Postgres by doing this. And to be clear, I have yet to explain the regression but this post documents my debugging efforts.
sysbench
This post explains how I use sysbench. Note that modern sysbench is a framework for running benchmarks and it comes with many built-in tests. By framework I mean that it includes much plumbing for several DBMS and you can write tests in Lua with support for the Lua JIT so the client side of the tests use less CPU.
The sysbench framework has been widely used in the MySQL community for a long time. Originally it hard-coded one (or perhaps a few tests) and in too many cases by sysbench people mean the classic (original) sysbench transaction that is a mix of point queries, range queries and writes. The classic transaction is now implemented by oltp_read_write.lua with help from oltp_common.lua.
The oltp_read_write.lua test is usually not good at detecting regressions but in this case (the problem motivating me to write this blog post) it has detected the regression.
How to find regressions
I have opinions about how to run DBMS benchmarks. For example, be careful about running read-only benchmarks for an LSM because the state (shape) of the LSM tree can have a large impact on CPU overhead and the LSM state (shape) might be in the same (bad or good) state for the duration of the test. With read-write tests the LSM state should cycle between better and worse states.
But traditional b-tree storage engines also have states. One aspect of that is MVCC debt -- write-heavy tests increase the debt and the engine doesn't always do a great job of limiting that debt. So I run sysbench tests in a certain order to both create and manage MVCC debt while trying to minimize noise.
The order of the tests is listed here and the general sequence is:
- Load and index the tables
- Run a few read-only tests to let MVCC debt get reduced if it exists
- Optionally run more read-only tests (I usually skip these)
- Run write-heavy tests
- Do things to reduce MVCC debt (see here for Postgres and MySQL)
- Run read-only tests (after the tables+indexes have been subject to random writes)
- Run delete-only and then insert-only tests
- update-zipf in Postgres 16, 17 and 18
- write-only in Postgres 16, 17 and 18
- read-write in Postgres 16, 17 and 18
- insert in Postgres 18 - this reproduces in 18.0 and 18.1
I wrote useful but messy Bash scripts to make it easier to run and explain sysbench results. One of the things I do is collect results from vmstat and iostat per test and then summarize average and normalized values from them where normalized values are: (avg from iostat or vmstat / QPS).
From the results below I see that from Postgres 17.7 to 18.1
- throughput decreases by ~20% in 18.1
- cpu/o increased by ~17% in 18.1
- cs/o increased by ~2.6% in 18.1
- r/o increased by ~14% in 18.1
- rKB/o increased by ~22% in 18.1
- wKB/o increased by ~27% in 18.1
* cpu/o - CPU per operation. This includes us and sy from vmstat but
The columns with a non-zero difference for Postgres 17.7 are:
- There is no regression for write-only and read-write when I change the test order so these run prior to update-one and update-zipf
- There is a regression if either or both of update-one and update-zip are run prior to write-only and read-write
Note that the update-one and update-zipf tests are a bit awkward. For the update-one test all updates are limited to one row per table (the first row in the table). And for the update-zipf test a zipfian distribution is used to select the rows to update. So in both cases a small number of rows receive most of the updates.
Results from pg_stat_all_tables collected immediately prior to update-one and then after update-zipf are here for Postgres 15.15 and 16.11. Then I computed the difference as (results after insert test - results after delete test).
The Importance of Realistic Benchmark Workloads
Database Transactions
January 13, 2026
BSON vs OSON: Different design goals
There was a discussion on LinkedIn comparing OSON (the binary JSON in Oracle Database) and BSON (the binary JSON in MongoDB). To be clear, BSON and OSON aren’t directly comparable because they serve different purposes:
- BSON is a general-purpose binary serialization format designed for transport and storage efficiency, similar to Protobuf. It prioritizes fast encode/decode operations for network transmission and disk storage across diverse applications.
- OSON is Oracle Database's internal JSON storage format, specifically engineered for database operations, with extensive metadata to enable efficient field querying and navigation without fully decoding the document.
They share two objectives but make different trade-offs:
- Compactness Through Binary Encoding – Both formats achieve compactness without compression through binary encoding and type-specific representations. However, BSON prioritizes encoding speed with a straightforward binary representation, whereas OSON achieves greater compactness through local dictionary compression of field names within documents.
- Partial Navigation Without Full Scanning – Both enable efficient partial document traversal, but with different approaches. BSON uses simple size prefixes to enable jumping between fields. A BSON document is meant to be stored as a single variable-size block (possible in the WiredTiger B-Tree). OSON implements comprehensive metadata structures—such as a tree segment and jumpable offsets—so that it can reduce random reads when stored in multiple fixed-size blocks (the Oracle Database storage model).
They differ in one major design objective regarding updates:
- BSON follows MongoDB's "store together what is accessed together" principle, optimizing for full-document read patterns and full-document writes to disk as a single contiguous block, rather than in-place modifications. Standard BSON documents are not designed for efficient partial updates outside of an in-memory mutable object representation.
- OSON is designed to optimize database operations, including in-place modifications without additional memory allocation.
I mentioned the Java driver in the previous post. The Oracle Java driver supports fast access via OracleJsonObject.get(), which avoids instantiating a new object and uses the internal metadata for navigation. I wanted to see how this works, but because the Oracle Database Java driver isn’t open source, I tried Python instead, where the Oracle driver is open source. However, it doesn’t provide an equivalent to OracleJsonObject.get(), so you must decode and encode to a Python dict, similar to BSON.
I ran the following program to measure the time and size to encode and decode BSON and OSON:
import time
import bson
import oracledb
from bson.codec_options import CodecOptions
from bson.raw_bson import RawBSONDocument
# Prepare RawBSON codec options once for lazy BSON decoding
raw_codec_options = CodecOptions(document_class=RawBSONDocument)
def generate_large_document(num_fields, field_length):
long_str = "a" * field_length
return {f"field_{i+1}": long_str for i in range(num_fields)}
def compare_bson_oson(document, connection, iterations=100):
"""Compare BSON and OSON encode/decode, plus access after decode."""
middle_field_name = f"field_{len(document)//2}"
# Precompute sizes
bson_data_sample = bson.encode(document)
bson_size = len(bson_data_sample)
oson_data_sample = connection.encode_oson(document)
oson_size = len(oson_data_sample)
# Timers
bson_encode_total = 0.0
bson_decode_total = 0.0
bson_access_after_decode_total = 0.0
bson_decode_raw_total = 0.0
bson_access_raw_total = 0.0
oson_encode_total = 0.0
oson_decode_total = 0.0
oson_access_after_decode_total = 0.0
for _ in range(iterations):
# BSON encode
start = time.perf_counter()
bson_data = bson.encode(document)
bson_encode_total += (time.perf_counter() - start)
# BSON decode raw (construct RawBSONDocument)
start = time.perf_counter()
raw_bson_doc = RawBSONDocument(bson_data, codec_options=raw_codec_options)
bson_decode_raw_total += (time.perf_counter() - start)
# BSON access single field from raw doc
start = time.perf_counter()
_ = raw_bson_doc[middle_field_name]
bson_access_raw_total += (time.perf_counter() - start)
# BSON full decode
start = time.perf_counter()
decoded_bson = bson.decode(bson_data)
bson_decode_total += (time.perf_counter() - start)
# BSON access after full decode
start = time.perf_counter()
_ = decoded_bson[middle_field_name]
bson_access_after_decode_total += (time.perf_counter() - start)
# OSON encode
start = time.perf_counter()
oson_data = connection.encode_oson(document)
oson_encode_total += (time.perf_counter() - start)
# OSON full decode
start = time.perf_counter()
decoded_oson = connection.decode_oson(oson_data)
oson_decode_total += (time.perf_counter() - start)
# OSON access after full decode
start = time.perf_counter()
_ = decoded_oson[middle_field_name]
oson_access_after_decode_total += (time.perf_counter() - start)
return (
bson_encode_total,
bson_decode_total,
bson_access_after_decode_total,
bson_size,
bson_decode_raw_total,
bson_access_raw_total
), (
oson_encode_total,
oson_decode_total,
oson_access_after_decode_total,
oson_size
)
def run_multiple_comparisons():
iterations = 100
num_fields_list = [10, 100, 1000]
field_sizes_list = [1000, 10000, 100000] # small length → large length
# Init Oracle client
oracledb.init_oracle_client(config_dir="/home/opc/Wallet")
connection = oracledb.connect(
user="franck",
password="My Strong P455w0rd",
dsn="orcl_tp"
)
print(f"{'Format':<6} {'Fields':<8} {'FieldLen':<10} "
f"{'Encode(s)':<12} {'Decode(s)':<12} {'Access(s)':<12} {'Size(bytes)':<12} "
f"{'DecRaw(s)':<12} {'AccRaw(s)':<12} "
f"{'EncRatio':<9} {'DecRatio':<9} {'SizeRatio':<9}")
print("-" * 130)
for field_length in field_sizes_list:
for num_fields in num_fields_list:
document = generate_large_document(num_fields, field_length)
bson_res, oson_res = compare_bson_oson(document, connection, iterations)
enc_ratio = oson_res[0] / bson_res[0] if bson_res[0] > 0 else 0
dec_ratio = oson_res[1] / bson_res[1] if bson_res[1] > 0 else 0
size_ratio = oson_res[3] / bson_res[3] if bson_res[3] > 0 else 0
# BSON row
print(f"{'BSON':<6} {num_fields:<8} {field_length:<10} "
f"{bson_res[0]:<12.4f} {bson_res[1]:<12.4f} {bson_res[2]:<12.4f} {bson_res[3]:<12} "
f"{bson_res[4]:<12.4f} {bson_res[5]:<12.4f} "
f"{'-':<9} {'-':<9} {'-':<9}")
# OSON row
print(f"{'OSON':<6} {num_fields:<8} {field_length:<10} "
f"{oson_res[0]:<12.4f} {oson_res[1]:<12.4f} {oson_res[2]:<12.4f} {oson_res[3]:<12} "
f"{'-':<12} {'-':<12} "
f"{enc_ratio:<9.2f} {dec_ratio:<9.2f} {size_ratio:<9.2f}")
connection.close()
if __name__ == "__main__":
run_multiple_comparisons()
I got the following results:
-
Encode(s)— total encode time over 100 iterations. -
Decode(s)— full decode into Python objects (dict). -
Access(s)— access to a field from Python objects (dict). -
DecRaw(s)— creation of aRawBSONDocumentfor BSON (no equivalent for OSON). -
AccRaw(s)— single middle‑field access from a raw document (lazy decode for BSON). - Ratios — OSON time / BSON time.
$ TNS_ADMIN=/home/opc/Wallet python bson-oson.py
Format Fields FieldLen Encode(s) Decode(s) Access(s) Size(bytes) DecRaw(s) AccRaw(s) EncRatio DecRatio SizeRatio
----------------------------------------------------------------------------------------------------------------------------------
BSON 10 1000 0.0005 0.0005 0.0000 10146 0.0002 0.0011 - - -
OSON 10 1000 0.0013 0.0006 0.0000 10206 - - 2.44 1.26 1.01
BSON 100 1000 0.0040 0.0043 0.0000 101497 0.0012 0.0101 - - -
OSON 100 1000 0.0103 0.0053 0.0000 102009 - - 2.58 1.25 1.01
BSON 1000 1000 0.0422 0.0510 0.0000 1015898 0.0098 0.0990 - - -
OSON 1000 1000 0.1900 0.0637 0.0000 1021912 - - 4.50 1.25 1.01
BSON 10 10000 0.0019 0.0017 0.0000 100146 0.0005 0.0025 - - -
OSON 10 10000 0.0045 0.0021 0.0000 100208 - - 2.36 1.27 1.00
BSON 100 10000 0.0187 0.0177 0.0000 1001497 0.0026 0.0225 - - -
OSON 100 10000 0.1247 0.0241 0.0000 1002009 - - 6.66 1.36 1.00
BSON 1000 10000 0.2709 0.2439 0.0001 10015898 0.0235 0.2861 - - -
OSON 1000 10000 14.4215 0.3185 0.0001 10021912 - - 53.23 1.31 1.00
Important nuance: In Python, oracledb.decode_oson() yields a standard dict, so we cannot measure lazy access as we can with the Java driver’s OracleJsonObject.get() method, which can skip object instantiation. We measured it for one field from the raw BSON to show that there is a cost, which is higher than accessing the dict, though still less than a microsecond for large documents. In general, since you store together what is accessed together, it often makes sense to decode to an application object.
Encoding OSON is slower than BSON, especially for large documents, by design—because it computes navigation metadata for faster reads—whereas BSON writes a contiguous field stream. For the largest case, encoding takes ~15 seconds over 100 iterations, which translates to milliseconds per operation.
Decoding BSON is marginally faster than OSON, but the difference is negligible since all decoding times are under a millisecond. OSON’s extra metadata helps mainly when reading a few fields from a large document, as it avoids instantiating an immutable object.
Raw BSON provides faster "decoding" (as it isn’t actually decoded), but slower field access. Still, this difference—less than a millisecond—is negligible except when accessing many fields, in which case you should decode once to a Python dict.
This test illustrates the different design goals of BSON and OSON. I used the Python driver to illustrage what an application does: get a document from a query and manipulate as an application object. On the server, it is different, and queries may modify a single field in a document. OSON will do it directly on the OSON datatype, as it has all metadata, whereas BSON will be accessed though mutable BSON object.
AWS Organizations now supports upgrade rollout policy for Amazon Aurora and Amazon RDS automatic minor version upgrades
January 12, 2026
Using PXC Replication Manager to Auto Manage Both Source and Replica Failover in Galera-Based Environments
January 10, 2026
Which Document class is best to use in Java to read MongoDB documents?
TL;DR: the answer is in the title, use Document.
BSON is a serialization format, similar to protobuf, designed for efficient document storage on disk and transfer over the network as a byte stream. Instead of scanning and rewriting the entire byte sequence to access its contents (fields, arrays, subdocuments), you work with an in-memory object that exposes methods to read and write fields efficiently. On MongoDB's server side, that's the mutable BSON object. On the client side, the drivers provide a similar API. Here are the five object types that implement the BSON interface in Java
- Document is the recommendation for most applications. It provides the best balance of flexibility, ease of use, and functionality.
Only consider the other classes when you have specific requirements:
BsonDocument: When you need strict BSON type safety rather than application types.
RawBsonDocument: When you need to access the raw bytes of the document rather than field values.
JsonObject: When working exclusively with JSON strings, as plain text.
BasicDBObject: Only for legacy code migration.
RawBsonDocument is used internally—for example, for client-side encryption and change streams—whereas the other classes can all be used directly in MongoDB driver operations. Your choice mainly impacts how you construct, manipulate, and access document data in your application code. They are documented:
I reviewed the documentation and code to better understand their differences. Especially when comparing with other databases which have different design and API. In this post I provide a detailed description of those classes, what they are, how they are implemented, what they provide, and when to use them
Detailed description
Here's a detailed comparison of the five document classes available in the MongoDB Java Driver and when to use each:
Document (org.bson)
Document is a flexible representation of a BSON document that implements Map<String, Object>. It uses a LinkedHashMap<String, Object> internally to maintain insertion order.
Key characteristics:
- Loosely-typed: Values are stored as
Object, allowing you to use standard Java types (String,Integer,Date, etc.) - Flexible: Easy to work with dynamically structured documents
- Map interface: Provides all standard
Mapoperations
Document is the recommended default choice for most applications. Use it when you want a flexible and concise data representation that's easy to work with.
BsonDocument (org.bson)
BsonDocument is a type-safe container for BSON documents that implements Map<String, BsonValue>. It also uses a LinkedHashMap, but stores BsonValue types.
Key characteristics:
- Type-safe: All values must be wrapped in BSON library types (
BsonString,BsonInt32,BsonDocument, etc.) - Stricter API: Provides compile-time type safety but requires more verbose code
- Map interface: Implements
Map<String, BsonValue>
Use BsonDocument when you need a type-safe API and want explicit control over BSON types. This is particularly useful when you need to ensure precise type handling or when working with APIs that require BsonDocument.
RawBsonDocument (org.bson)
RawBsonDocument is an immutable BSON document represented using only raw bytes. It stores the BSON document as a byte array without parsing it.
Key characteristics:
- Immutable: All mutation operations throw
UnsupportedOperationException - Lazy parsing: Data is only parsed when accessed, making it very efficient for pass-through scenarios, when not for accessing each individual field
- Memory efficient: Stores raw bytes, avoiding object allocation overhead
- Can decode to other types: Provides
decode()method to convert to other document types when needed
Use RawBsonDocument when you need maximum performance and memory efficiency for whole-document operations. It is particularly useful when reading documents from MongoDB and passing them to another system unchanged, when working with large documents that you don’t need to parse, when building high-performance data pipelines where parsing overhead matters, and when you need an immutable document representation.
JsonObject (org.bson.json)
JsonObject is a wrapper class that holds a JSON object string. It simply stores the JSON as a String.
Key characteristics:
- Does not implement Map: It's just a string wrapper with validation
- No parsing required: Avoids conversion to
Mapstructure if you're just working with JSON - JSON-focused: Designed for applications that primarily work with JSON strings
- Supports Extended JSON: Works with MongoDB Extended JSON format
Use JsonObject when you want to work directly with JSON strings and avoid the overhead of converting to and from Map objects. This is ideal for REST APIs that consume and produce JSON, for logging or persisting documents as JSON strings, and for applications that primarily handle JSON and do not require programmatic field-level access.
BasicDBObject (com.mongodb)
BasicDBObject is a legacy BSON object implementation that extends BasicBSONObject and implements the DBObject interface.
Key characteristics:
- Legacy class: Exists for backward compatibility with older driver versions
- Does not implement Map: Only implements the
DBObjectinterface, lacking modern Map convenience methods - Binary compatibility concerns: Implements an interface rather than extending a class, which can cause compatibility issues
Only use BasicDBObject when migrating from a legacy driver version (pre-3.0). The documentation explicitly recommends avoiding this class for new development due to its limitations.
Conversion Between Types
All of these classes implement the Bson interface, which allows them to be used interchangeably in MongoDB operations (but without the same performance). You can also convert between types:
-
BsonDocument.parse(json)to create from JSON -
RawBsonDocument.decode(codec)to convert RawBsonDocument to another type -
Documentcan be converted toBsonDocumentthrough codec operations -
JsonObject.getJson()to extract the JSON string
Comparing RawBsonDocument and Document
RawBsonDocument stores BSON as a raw byte array and does not deserialize it in advance. When you access a field, it creates a BsonBinaryReader and scans the document sequentially, reading each field’s type and name until it finds the requested key. Only the matching field is decoded using RawBsonValueHelper.decode, while all other fields are skipped without parsing. For nested documents and arrays, it reads only their sizes and wraps the corresponding byte ranges in new RawBsonDocument or RawBsonArray instances, keeping their contents as raw bytes. This approach provides fast access for a single field lookup, while being memory-efficient and keeping the document immutable, which is ideal for large documents where only a few fields are needed or for documents that are mostly passed through without inspection.
In contrast, Document uses a fully deserialized LinkedHashMap<String, Object>. When a Document is created, all fields are eagerly parsed into Java objects. Field access and containsKey operations are simple HashMap lookups, and the document is fully mutable, supporting standard map operations such as put, remove, and clear. This design consumes more memory but is better suited to small or medium-sized documents, scenarios where many fields are accessed, or cases where the document needs to be modified frequently.
Finally, Document doesn't use RawBsonDocument for parsing or accessing fields since doing so would be inefficient, and the two serve different purposes.
Comparison with Oracle (OSON) and PostgreSQL (JSONB)
Neither Oracle nor PostgreSQL provides BSON as they use OSON and JSONB, so there's no BsonDocument or RawBsonDocument equivalent.
In Oracle’s JDBC driver, the closest equivalent to a Document is OracleJsonObject, one of the OracleJsonValue types, which can be exposed directly as a javax.json.JsonObject or mapped domain object. This API works directly on the underlying OSON bytes without fully parsing the entire document into an intermediate data structure. OSON is more than a raw serialization: it carries its own local dictionary of distinct field names, a sorted array of hash codes for those names, and compact field ID arrays and value offsets, enabling the driver to locate a field in place by binary‑searching the field ID array, and jumping to the right offset.
If JSON text is needed instead of the object model, the equivalent is simply to use ResultSet.getString(), which will convert the OSON image to JSON text on the client.
PostgreSQL’s JDBC driver, by contrast, offers no native Java JSON object API for either json or jsonb columns: values are always returned as text, so the application must parse the string into its own document model using a separate JSON library. Even when using PostgreSQL’s binary JSONB storage, none of the binary efficiency crosses the wire (See JSONB vs. BSON: Tracing PostgreSQL and MongoDB Wire Protocols), and client code still performs a full parse before accessing individual fields.
Conclusion
MongoDB’s main advantage for modern applications—whatever the data types or workloads—is the ability to work with data directly through your domain objects, without an intermediate object-relational mapping layer or view. Use the Document class as your document object model (DOM). It offers flexibility, map-style access, and a natural Java interface, while the driver transparently converts BSON from the network into objects your application can use immediately.
January 09, 2026
Cloudspecs: Cloud Hardware Evolution Through the Looking Glass
This paper (CIDR'26) presents a comprehensive analysis of cloud hardware trends from 2015 to 2025, focusing on AWS and comparing it with other clouds and on-premise hardware.
TL;DR: While network bandwidth per dollar improved by one order of magnitude (10x), CPU and DRAM gains have been much more modest. Most surprisingly, NVMe storage performance in the cloud has stagnated since 2016. Check out the NVMe SSD discussion below for data on this anomaly.
CPU Trends
Multi-core parallelism has skyrocketed in the cloud. Maximum core counts have increased by an order of magnitude over the last decade. The largest AWS instance u7in now boasts 448 cores. However, simply adding cores hasn't translated linearly into value. To measure real evolution, the authors normalized benchmarks (SPECint, TPC-H, TPC-C) by instance cost. SPECint benchmarking shows that cost-performance improved roughly 3x over ten years. A huge chunk of that gain comes from AWS Graviton. Without Graviton, the gain drops to roughly 2x. For in-memory database benchmarks, gains were even lower (2x–2.5x), likely due to memory and cache latency bottlenecks.
On-prem hardware comparison shows that this stagnation is not cloud price gouging. Historically, Moore's Law and Dennard scaling doubled cost-performance every two years (which would have sum up to 32x gain over a decade). However, an analysis of on-premise AMD server CPUs reveals a similar slump, only a 1.7x gain from 2017 to 2025.
Memory Trends
DRAM capacity per dollar has effectively flatlined. The only significant improvement was the 2016 introduction of memory-optimized x instances, which offered ~3.3x more GiB-hours/$ than compute-optimized peers. While absolute single-socket bandwidth jumped ~5x (93 GiB/s to 492 GiB/s) as servers moved from DDR3 to DDR5, the cost-normalized gain is only 2x.
Historical data suggests commodity DRAM prices dropped 3x over the decade. But in the last three months, due to AI-driven demand, DDR5 prices rose sharply, further limiting effective memory gains.
Network Trends
We have good news here, finally. Network bandwidth per dollar exploded by 10x. And absolute speeds went from 10 Gbit/s to 600 Gbit/s (60x).
These gains were not universal though. Generic instances saw little change. The gains were driven by network-optimized n instances (starting with the c5n in 2018) powered by proprietary Nitro cards.
NVMe Trends
NVMe SSDs are the biggest surprise. Unlike CPUs and memory, where cloud trends mirror on-prem hardware, NVMe performance in AWS has largely stagnated. The first NVMe-backed instance family, i3, appeared in 2016. As of 2025, AWS offers 36 NVMe instance families. Yet the i3 still delivers the best I/O performance per dollar by nearly 2x.
SSD capacity has stagnated since 2019 and I/O throughput since 2016. This sharply contrasts with on-prem hardware, where SSD performance doubled twice (PCIe 4 and PCIe 5) in the same timeframe. The gap between cloud and on-premise NVMe is widening rapidly.
This price/performance gap likely explains the accelerating push toward disaggregated storage. When local NVMe is expensive and underperforming, remote storage starts to look attractive. The paper speculates that with network speeds exploding and NVMe stagnating, architectures may shift further. For systems like Snowflake, using local NVMe for caching might no longer be worth the complexity compared to reading directly from S3 with fast networks.
Discussion
I think the main takeaway is that uniform hardware scaling in the cloud is over. Moore's Law no longer lifts all boats. Performance gains now come from specialization, especially networking (e.g., Graviton, Nitro, Accelerators).
In my HPTS 2024 review, I noted that contrary to the deafening AI hype, the real excitement in the hallways was about hardware/software codesign. This paper validates that sentiment. With general-purpose CPU and memory cost-performance stagnating, future databases must be tightly integrated with specialized hardware and software capabilities to provide value. I think the findings here will refuel that trend.
A key open question is why massive core counts deliver so little value. Where is the performance lost? Possible explanations include memory bandwidth limits, poor core-to-memory balance, or configuration mismatches. But I think the most likely culprit is software. Parallel programming remains hard, synchronization is expensive, and many systems fail to scale beyond a modest number of cores. We may be leaving significant performance on the table simply because our software cannot effectively utilize the massive parallelism now available.
The paper comes with an interactive tool, Cloudspecs, built on DuckDB-WASM (yay!). This allows you to run SQL queries over the dataset directly in the browser to visualize these trends. The figures in the PDF actually contain clickable link symbols that take you to the specific query used to generate that chart. Awesome reproducibility!
Aleksey and I did a live-reading of the paper. As usual, we had a lot to argue about. I'll add a recording of our discussion on YouTube when it becomes available, and here is a link to my annotated paper.
Best Practices to Backfill Materialized Views in ClickHouse® Safely
January 08, 2026
Unlock Amazon Aurora’s Advanced Features with Standard JDBC Driver using AWS Advanced JDBC Wrapper
Postgres vs tproc-c on a small server
This is my first post with results from tproc-c using HammerDB. This post has results for Postgres.
tl;dr - across 8 workloads (low and medium concurrency, cached database to IO-bound)
- there might be a regression for Postgres 14.20 and 15.15 in one workload
- there are improvements, some big, for Postgres 17 and 18 in most workloads
The server is an ASUS ExpertCenter PN53 with an AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, and 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.
Benchmark
The benchmark was run for several workloads:
- vu=1, w=100 - 1 virtual user, 100 warehouses
- vu=6, w=100 - 6 virtual users, 100 warehouses
- vu=1, w=1000 - 1 virtual user, 1000 warehouses
- vu=6, w=1000 - 6 virtual users, 1000 warehouses
- vu=1, w=2000 - 1 virtual user, 2000 warehouses
- vu=6, w=2000 - 6 virtual users, 2000 warehouses
- vu=1, w=4000 - 1 virtual user, 4000 warehouses
- vu=6, w=4000 - 6 virtual users, 4000 warehouses
- stored procedures are enabled
- partitioning is used for when the warehouse count is >= 1000
- a 5 minute rampup is used
- then performance is measured for 120 minutes
(NOPM for a given version / NOPM for Postgres 12.22)
Results: vu=1, w=100
- no regressions, no improvements
- no regressions, no improvements
- no regressions, improvements in Postgres 17 and 18
- no regressions, improvements in Postgres 16, 17 and 18
- possible regressions in Postgres 14 and 15, improvements in 13, 16, 17\ and 18
- no regressions, improvements in Postgres 13 through 18
- no regressions, improvements in Postgres 13 through 18
- no regressions, improvements in Postgres 16 through 18
Introducing TRAE SOLO integration with Supabase
The Sauna Algorithm: Surviving Asynchrony Without a Clock
While sweating it out in my gym's sauna recently, I found a neat way to illustrate the happened-before relationship in distributed systems. Imagine I suffer from a medical condition called dyschronometria, which makes me unable to perceive time reliably, such that 10 seconds and 10 minutes feel exactly the same to me.
In this scenario, the sauna lacks a visible clock. I'm flying blind here, yet I want to leave after a healthy session. If I stay too short, I get no health benefits. If I stay too long, I risk passing out on the floor. The question becomes: How do I, a distributed node with no local clock, ensure operating within a safety window in an asynchronous environment?
Thankfully, the sauna has a uniform arrival of people. Every couple of minutes, a new person walks in. These people don't suffer from dyschronometria and they stay for a healthy session, roughly 10 minutes.
My solution is simple: I identify the first person to enter after me, and I leave when he leaves. I don't measure time at all. Instead, I rely solely on the causal relationship between this person's arrival and departure. The presence of this other person acts as a reference, letting me safely synchronize my action in an entirely asynchronous setting.
Murat's Sauna Algorithm
- Enter the sauna.
- Identify the very next person to enter (let's call him A).
- Wait until A gets up to leave.
- Exit immediately after A.
By anchoring my exit to the departure of a person who arrived after me, I guarantee that my stay overlaps entirely with his healthy duration, plus the delta between my arrival and his. I have successfully bounded my sauna/execution time using causal links rather than a physical timer.
Now, for the distributed systems purists who might object to step 4 (arguing that "exiting immediately" implies measuring a time interval of zero): I can mend this. I exit after Person A leaves, but before the next person leaves. That keeps us in the realm of strictly causal reasoning.
Consider what would go wrong if I anchored my departure to someone already in the sauna when I entered. That person might be near the end of their session, causing me to leave in just a few seconds. In distributed systems, this is akin to reading the state of an outdated node to make a decision, leading to inconsistent outcomes. Anchoring only to events that occur after my entry ensures I account for all relevant causality.
There is, of course, a safe way to rely on the people already inside. I could take a snapshot of the room upon entry, identify everyone present, and refuse to leave until the entire set of them has departed. This guarantees a sufficient duration (assuming at least one person recently arrived), but it requires more memory. Instead of tracking a single reference point (the newcomer), I would have to maintain state for everyone in the sauna.
Let's connect back to distributed systems. In an asynchronous system, nodes rely on causal relationships between events to order them. Just as I watch the newcomer to determine my exit, a node can observe the messages sent by other nodes to determine safe ordering. For example, in a distributed database, a transaction may want to commit only after all causally prior transactions have completed, ensuring consistency without requiring synchronized clocks. Reasoning about happened-before relationships allows asynchronous systems to coordinate safely without synchronized clocks. See discussions on logical clocks, vector clocks, and hybrid logical clocks in my previous posts.
Before anybody gets too concerned, no, I don't actually suffer from dyschronometria. However, back when I was a professor, my PhD students often joked that I operated on logical clocks rather than physical ones.
Also, in reality, the gym's sauna does have a clock visible from the door. I check it, and aim to stay for about 15 minutes. But I still insist on using my sauna algorithm, well at least the first three steps, because I am a competitive SOB. I try to outcompete the average person and, knock on sauna-bench wood, I haven't had to break my rule yet. It would be absolutely catastrophic if everyone started using my algorithm, though. We'd all be waiting for the next person to leave, resulting in a deadlock. Maybe the memory-costly snapshot algorithm is better after all, as it avoids this problem.
Related post: Consistent snapshot analogies