You won’t always have a perfect index for every query, but may have have single-field indexes for each filter. In such cases, PostgreSQL can use Bitmap Scan to combine these indexes. MongoDB is also capable of merging multiple index bounds in a single scan or using index intersection to combine separate scans. Yet, the MongoDB query planner rarely selects index intersection. Let’s look at the reasons behind this.
TL;DR: if you think you need index intersection, you probably need better compound indexes.
Test Setup
I create a collection with one hundred thousand documents and two fields, with an index on each:
let bulk = [];
for (let i = 0; i < 100000; i++) {
bulk.push({
a: Math.floor(Math.random()*100),
b: Math.floor(Math.random()*100)
});
}
db.demo.insertMany(bulk);
// separate indexes
db.demo.createIndex({ a: 1 });
db.demo.createIndex({ b: 1 });
In PostgreSQL, we’ll mirror the dataset as follow:
CREATE TABLE demo AS
SELECT id,
(random()*100)::int AS a,
(random()*100)::int AS b
FROM generate_series(1,100000) id;
CREATE INDEX demo_a_idx ON demo(a);
CREATE INDEX demo_b_idx ON demo(b);
In my MongoDB collection of 100,000 documents, only nine documents have both the "a" and "b" fields set to 42:
mongo> db.demo.countDocuments()
100000
mongo> db.demo.find({ a: 42, b: 42 }).showRecordID()
[
{ _id: ObjectId('6928697ae5fd2cdba9d53f54'), a: 42, b: 42, '$recordId': Long('36499') },
{ _id: ObjectId('6928697ae5fd2cdba9d54081'), a: 42, b: 42, '$recordId': Long('36800') },
{ _id: ObjectId('6928697ae5fd2cdba9d54a7c'), a: 42, b: 42, '$recordId': Long('39355') },
{ _id: ObjectId('6928697ae5fd2cdba9d55a3e'), a: 42, b: 42, '$recordId': Long('43389') },
{ _id: ObjectId('6928697ae5fd2cdba9d5a214'), a: 42, b: 42, '$recordId': Long('61779') },
{ _id: ObjectId('6928697ae5fd2cdba9d5e52a'), a: 42, b: 42, '$recordId': Long('78953') },
{ _id: ObjectId('6928697ae5fd2cdba9d5eeea'), a: 42, b: 42, '$recordId': Long('81449') },
{ _id: ObjectId('6928697ae5fd2cdba9d61f48'), a: 42, b: 42, '$recordId': Long('93831') },
{ _id: ObjectId('6928697ae5fd2cdba9d61f97'), a: 42, b: 42, '$recordId': Long('93910') }
]
In my PostgreSQL database, there are 100,000 rows and, among them, nine rows have the value 42 in both the "a" and "b" columns:
postgres=# select count(*) from demo;
count
--------
100000
(1 row)
postgres=# SELECT *, ctid FROM demo WHERE a = 42 AND b = 42;
a | b | id | ctid
----+----+-------+-----------
42 | 42 | 4734 | (25,109)
42 | 42 | 15678 | (84,138)
42 | 42 | 29464 | (159,49)
42 | 42 | 29748 | (160,148)
42 | 42 | 31139 | (168,59)
42 | 42 | 37785 | (204,45)
42 | 42 | 55112 | (297,167)
42 | 42 | 85823 | (463,168)
42 | 42 | 88707 | (479,92)
I displayed the CTID for PostgreSQL and the RecordID for MongoDB, to see the distribution over the heap table (for PostgreSQL) or the WiredTiger B-Tree (for MongoDB).
MongoDB possible execution plans
I have executed db.demo.find({ a: 42, b: 42 }), and multiple plans have been evaluated:
mongo> db.demo.getPlanCache().list();
[
{
version: '1',
queryHash: 'BBC007A6',
planCacheShapeHash: 'BBC007A6',
planCacheKey: '51C56FDD',
isActive: true,
works: Long('968'),
worksType: 'works',
timeOfCreation: ISODate('2025-11-27T15:09:11.069Z'),
createdFromQuery: { query: { a: 42, b: 42 }, sort: {}, projection: {} },
cachedPlan: {
stage: 'FETCH',
filter: { b: { '$eq': 42 } },
inputStage: {
stage: 'IXSCAN',
keyPattern: { a: 1 },
indexName: 'a_1',
isMultiKey: false,
multiKeyPaths: { a: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { a: [ '[42, 42]' ] }
}
},
creationExecStats: [
{
nReturned: 12,
executionTimeMillisEstimate: 0,
totalKeysExamined: 967,
totalDocsExamined: 967,
executionStages: {
stage: 'FETCH',
filter: { b: { '$eq': 42 } },
nReturned: 12,
executionTimeMillisEstimate: 0,
works: 968,
advanced: 12,
needTime: 955,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 967,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 967,
executionTimeMillisEstimate: 0,
works: 968,
advanced: 967,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { a: 1 },
indexName: 'a_1',
isMultiKey: false,
multiKeyPaths: { a: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { a: [Array] },
keysExamined: 967,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
},
{
nReturned: 10,
executionTimeMillisEstimate: 0,
totalKeysExamined: 968,
totalDocsExamined: 968,
executionStages: {
stage: 'FETCH',
filter: { a: { '$eq': 42 } },
nReturned: 10,
executionTimeMillisEstimate: 0,
works: 968,
advanced: 10,
needTime: 958,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 0,
docsExamined: 968,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 968,
executionTimeMillisEstimate: 0,
works: 968,
advanced: 968,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 0,
keyPattern: { b: 1 },
indexName: 'b_1',
isMultiKey: false,
multiKeyPaths: { b: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { b: [Array] },
keysExamined: 968,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
}
}
},
{
nReturned: 7,
executionTimeMillisEstimate: 0,
totalKeysExamined: 968,
totalDocsExamined: 7,
executionStages: {
stage: 'FETCH',
filter: { '$and': [ [Object], [Object] ] },
nReturned: 7,
executionTimeMillisEstimate: 0,
works: 968,
advanced: 7,
needTime: 961,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 0,
docsExamined: 7,
alreadyHasObj: 0,
inputStage: {
stage: 'AND_SORTED',
nReturned: 7,
executionTimeMillisEstimate: 0,
works: 968,
advanced: 7,
needTime: 961,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 0,
failedAnd_0: 232,
failedAnd_1: 230,
inputStages: [ [Object], [Object] ]
}
}
}
],
candidatePlanScores: [ 2.012596694214876, 1.0105305785123966, 1.0073314049586777 ],
indexFilterSet: false,
estimatedSizeBytes: Long('4826'),
solutionHash: Long('6151768200665613849'),
host: '40ae92e83a12:27017'
}
]
The cached plan uses only one index, on "a", but there are two additional possible plans in the cache: one using the index on "b" and another using a combination of both indexes with AND_SORTED. The scores (candidatePlanScores) are:
- 2.012 for the index on "a"
- 1.010 for the index on "b"
- 1.007 for the AND_SORTED intersection of the indexes on "a" and "b"
This may be surprising, and given how the data was generated, we should expect similar costs for the two indexes. We can see that during the trial period on the query plans, the index on "a" finished the scan (isEOF: 1), and even though the other two had similar performance and were going to end, the trial period ended before they reached the end (isEOF: 0). MongoDB adds an EOF bonus of 1 when one when the trial plan finishes before the others, and that explains why the score is higher. So it's not really that the index on "a" is better than the other plans, but just that all plans are good, and the first one started and finished first, and got the bonus.
In addition to that, there's another small penalty on index intersection. Finally the scores are:
- Index on "a": 1 (base) + 0.012 (productivity) + 1.0 (EOF bonus) = 2.012
- Index on "b": 1 (base) + 0.010 (productivity) + 0 (no EOF) = 1.010
- AND_SORTED: 1 (base) + 0.007 (productivity) + 0 (no EOF) = 1.007
Without the penalties, AND_SORTED would still not have been chosen. The problem is that the score measure productivity in units of work (advanced/work) but do not account for lighter work: one index scan must fetch the document and apply the additional filter in one work unit, where AND_SORTED doesn't and waits for the intersection without additional fetch and filter.
To show the AND_SORTED plan, I'll force it on my lab database for the following examples in this article:
// get the current parametrs (default):
mongo> Object.keys(db.adminCommand({ getParameter: "*" })).filter(k => k.toLowerCase().includes("intersection")) .forEach(k => print(k + " : " + allParams[k]));
internalQueryForceIntersectionPlans : false
internalQueryPlannerEnableHashIntersection : false
internalQueryPlannerEnableIndexIntersection : true
// set all at true:
db.adminCommand({
setParameter: 1,
internalQueryPlannerEnableIndexIntersection: true,
internalQueryPlannerEnableHashIntersection: true,
internalQueryForceIntersectionPlans: true
});
I have set internalQueryForceIntersectionPlans to force index intersection (it still uses the query planner, but with a 3-point boost to the score). Index intersection is possible for AND_SORTED by default, but I also set AND_HASH for another test later that cannot use AND_SORTED.
Index Intersection in MongoDB
Now that I forced index intersection, I can observe it with execution statistics:
db.demo.find({ a: 42, b: 42 }).explain("executionStats").executionStats
Execution plan shows that both indexes were scanned with one range (seeks: 1), and combined with an AND_SORTED before fetching the documents for the result:
{
executionSuccess: true,
nReturned: 9,
executionTimeMillis: 4,
totalKeysExamined: 2009,
totalDocsExamined: 9,
executionStages: {
isCached: false,
stage: 'FETCH',
filter: {
'$and': [ { a: { '$eq': 42 } }, { b: { '$eq': 42 } } ]
},
nReturned: 9,
executionTimeMillisEstimate: 0,
works: 2010,
advanced: 9,
needTime: 2000,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
docsExamined: 9,
alreadyHasObj: 0,
inputStage: {
stage: 'AND_SORTED',
nReturned: 9,
executionTimeMillisEstimate: 0,
works: 2010,
advanced: 9,
needTime: 2000,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
failedAnd_0: 501,
failedAnd_1: 495,
inputStages: [
{
stage: 'IXSCAN',
nReturned: 964,
executionTimeMillisEstimate: 0,
works: 965,
advanced: 964,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 1,
keyPattern: { a: 1 },
indexName: 'a_1',
isMultiKey: false,
multiKeyPaths: { a: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { a: [ '[42, 42]' ] },
keysExamined: 964,
seeks: 1,
dupsTested: 0,
dupsDropped: 0
},
{
stage: 'IXSCAN',
nReturned: 1045,
executionTimeMillisEstimate: 0,
works: 1045,
advanced: 1045,
needTime: 0,
needYield: 0,
saveState: 0,
restoreState: 0,
isEOF: 0,
keyPattern: { b: 1 },
indexName: 'b_1',
isMultiKey: false,
multiKeyPaths: { b: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { b: [ '[42, 42]' ] },
keysExamined: 1045,
seeks: 1,
dupsTested: 0,
by Franck Pachot
AWS Database Blog - Amazon Aurora
Netflix operates a global streaming service that serves hundreds of millions of users through a distributed microservices architecture. In this post, we examine the technical and operational challenges encountered by their Online Data Stores (ODS) team with their current self-managed distributed PostgreSQL-compatible database, the evaluation criteria used to select a database solution, and why they chose to migrate to Amazon Aurora PostgreSQL to meet their current and future performance needs. The migration to Aurora PostgreSQL improved their database infrastructure, achieving up to 75% increase in performance and 28% cost savings across critical applications.
by Jordan West, Ammar Khaku
November 26, 2025
AWS Database Blog - Amazon Aurora
With the Letta Developer Platform, you can create stateful agents with built-in context management (compaction, context rewriting, and context offloading) and persistence. Using the Letta API, you can create agents that are long-lived or achieve complex tasks without worrying about context overflow or model lock-in. In this post, we guide you through setting up Amazon Aurora Serverless as a database repository for storing Letta long-term memory. We show how to create an Aurora cluster in the cloud, configure Letta to connect to it, and deploy agents that persist their memory to Aurora. We also explore how to query the database directly to view agent state.
by Sarah Wooders
Percona Database Performance Blog
Where We Are We can all agree that the MySQL ecosystem isn’t in great shape right now. Take a look at Julia’s blog post [Analyzing the Heartbeat of the MySQL Server: A Look at Repository Statistics], which confirms what many of us have felt: Oracle isn’t as committed to MySQL and its ecosystem as it […]
by Vadim Tkachenko
ParadeDB Blog
Introducing search aggregation, V2 API as default, and performance improvements that eliminate the complexity between search and analytics in a single Postgres-native system.
by Philippe Noël
November 25, 2025
AWS Database Blog - Amazon Aurora
In this post, I explore how Amazon Aurora DSQL uses Amazon Time Sync Service to build a hybrid logical clock solution.
by Katja-Maja Kroedel
AWS Database Blog - Amazon Aurora
Amazon Aurora DSQL employs an active-active distributed database design, wherein all database resources are peers and serve both write and read traffic within a Region and across Regions. This design facilitates synchronous data replication and automated zero data loss failover for single and multi-Region Aurora DSQL clusters. In this post, I discuss the individual components and the responsibilities of a multi-Region distributed database to provide an ACID-compliant, strongly consistent relational database.
by Katja-Maja Kroedel
AWS Database Blog - Amazon Aurora
In this third post of the series, I examine the end-to-end processing of the two transaction types in Aurora DSQL: read-only and read-write. Amazon Aurora DSQL doesn’t have write-only transactions, since it’s imperative to verify the table schema or ensure the uniqueness of primary keys on each change – which results them being read-write transactions as well.
by Katja-Maja Kroedel
AWS Database Blog - Amazon Aurora
In this second post, I examine Aurora DSQL's architecture and explain how its design decisions impact functionality—such as optimistic locking and PostgreSQL feature support—so you can assess compatibility with your applications. I provide a comprehensive overview of the underlying architecture, which is fully abstracted from the user.
by Katja-Maja Kroedel
AWS Database Blog - Amazon Aurora
In this post, I dive deep into fundamental concepts that are important to comprehend the benefits of Aurora DSQL, its feature set, and its underlying components.
by Katja-Maja Kroedel
Percona Database Performance Blog
At Percona, our mission has always been to help you succeed with open source databases. We do that by listening to the community, understanding the challenges you face, and building the solutions you need. Now, after a comprehensive review of market trends and direct feedback from our customers and the MySQL community, we are excited […]
by Dennis Kittrell
Murat Demirbas
The Atropos paper (SOSP'25) argues that overload-control systems are built on a flawed assumption. They monitor global signals (like queue length or tail latency) to adjust admission control (throttling new arrivals or dropping random requests). This works when the bottleneck is CPU or network, but it fails when the real problem is inside the application. This considers only the symptoms but not the source. As a result, it drops the victims rather than the culprits.
Real systems often run into overload because one or two unlucky timed requests monopolize an internal logical resource (like buffer pools, locks, and thread-pool queues). These few rogue whales have nonlinear effects. A single ill-timed dump query can thrash the buffer pool and cut throughput in half. A single backup thread combined with a heavy table scan can stall writes in MySQL as seen in Figure 3. The CPU metrics will not show this.
Atropos proposes a simple fix to this problem. Rather than throttling or dropping victims at the front of the system, it continuously monitors how tasks use internal logical resources and cancels the ones most responsible for the collapse. The name comes from the three Greek Fates. Clotho spins the thread of life, Lachesis measures its length, and Atropos cuts it when its time is up. The system plays the same role: it cuts the harmful task to protect the others.
The first interesting point in the paper is that their survey of 151 real applications (databases, search engines, and distributed systems) shows that almost all of them already contain safe cancellation hooks. Developers have already built the ability to stop a task cleanly. So the concerns about cancellation being too dangerous or too invasive turn out to be outdated. The support is already there; what's missing is a systematic way to decide which tasks to cancel in the first place.
To identify which tasks are the rogue whales, Atropos introduces a lightweight way to track logical resource usage. It instruments three operations: acquiring a resource, releasing it, and waiting on it. These are cheap counters and timestamps. For memory this wraps code that gets or frees buffer pages. For locks it wraps lock acquisition and lock waits. For queues it wraps enqueue and dequeue. The runtime can now trace who is touching which resource and who is blocking whom.
The map above (from Nano Banana Pro) lays out the Atropos design. Let's walk through its main pieces.
The runtime borrows the "overload detection" from prior work (Breakwater). When latency rises above the SLO but throughput stays flat, overload is present. At that moment Atropos inspects the resource traces and identifies hot spots. It computes two measures per resource per task.
- The first measure is the contention level. This is a measure of how much time tasks spend waiting due to that resource. High eviction ratios, long lock wait ratios, and long queue wait times all point to a rogue whale task.
- The second measure is the resource gain. This estimates how much future load would be relieved if a task were canceled. This is the clever idea. The naive approach counts past usage, but a task that has consumed many pages but is almost finished does not pose much future harm. A task that has barely begun a huge scan is the real whale to watch, because its future thrashing potential is far larger than what its small past footprint suggests. The system uses progress estimates to scale current usage by remaining work. Databases usually track rows read vs rows expected. Other systems provide analogous application-specific counters for an estimate of future demand. This allows the controller to avoid killing nearly finished tasks.
The policy engine considers all resources together. Some tasks stress memory, others stress locks, and others queues. A single-resource policy would make narrow decisions. Instead Atropos identifies the non-dominated set of tasks across all resources and computes a weighted score, where the weights are the contention levels of the resources. The resultant score is the expected gain from canceling that task. The task with the highest score is canceled using the application's own cancellation mechanism.
The evaluation of the paper is strong. This is a prototypical SOSP paper. The authors reproduce sixteen real overload bugs in systems like MySQL, Postgres, Elasticsearch, Solr, Apache, and etcd. These include buffer pool thrashing, lock convoys, queue stalls, and indexing contention. Across these cases Atropos restores throughput close to the baseline. The median result is around ninety six percent of normal throughput under overload. Tail latency stays near normal. The cancellation rate is tiny: less than one in ten thousand requests! Competing approaches must drop far more requests and still fail to restore throughput. The key here is the nonlinear effect of freeing internal logical resources. Canceling the right task unblocks a crowd of others.
As usual, Aleksey and I did our live blind read of it, which you can watch below. My annotated copy of the paper is also available here.
What I like about the paper is its clarity. The motivating examples are strong and concrete. The design is small, understandable, and modular. The progress-based future estimation is a good idea. The policy avoids naive heuristics. This is not a general overload controller for all situations. It does not manage CPU saturation or network overload. Instead it solves overload caused by contention on internal logical resources by killing the rogue whale, and quickly restoring normal behavior.
by Murat (noreply@blogger.com)
CedarDB Blog
CedarDB Tames the Slopes (of the graph)
It would be natural to ask about the connection between skiing, or taking to
the slopes, and database systems. That connection arose out of the way I related to
the graph, below, that I created for a recent POC, showing the relationship between the time to run 750 queries
against a table, for various row counts, in PostgreSQL, TimescaleDB, and CedarDB.
In skiing terminology, the blue line might be a black diamond run, the green
one a beginner run, and the grey line along the x-axis is almost in an entirely different category and more fit for cross-country skiing.
What blew my mind was that the slope of the CedarDB line was essentially zero.
The motivation for these database comparisons is a SQL query workload to
support an app built by a biotech startup. The app interacts with their
hardware device, which collects telemetry based on a user’s activities and,
among other things, provides access to a library of content that the user might
enjoy. The team at this startup provided us with some background on their
database challenges, and that included lists such as top 20 slowest queries
and top 20 heaviest queries. After analyzing these, we identified a query
that appeared high on both lists, one that informs their recommendation
pipeline, so we started there:
SELECT count(*) AS count_1
FROM (
SELECT track_plays.time AS track_plays_time
, track_plays.end_time AS track_plays_end_time
, track_plays.user_id AS track_plays_user_id
, track_plays.session_id AS track_plays_session_id
, track_plays.track_id AS track_plays_track_id
, track_plays.duration_sec AS track_plays_duration_sec
, track_plays.source AS track_plays_source
FROM track_plays
WHERE
track_plays.time >= $1
AND track_plays.end_time <= $2
AND track_plays.user_id = $3
AND track_plays.duration_sec >= $4
) AS anon_1;
Looking at it, it’s not complex: just a table scan with a filter on a couple of TIMESTAMP and INTEGER columns.
How Timescale Makes the Slope Beginner Friendly
Credit where credit is due, let’s first take a look at how TimescaleDB managed to flatten the slope considerably for our POC customer.
TimescaleDB is a PostgreSQL extension designed to optimize the processing of time-series data, which is exactly what the track_plays table tracks.
To really see the effect above with TimescaleDB, you must explicitly set it up using Hypertables,
which partition the data by the time column using hourly partitions (chunk_interval). This, however, requires advance knowledge of your data and query patterns.
...
WITH
(
timescaledb.hypertable,
timescaledb.partition_column='time',
timescaledb.chunk_interval='1 hour'
);
TimescaleDB can then use the partitions at query time, selecting only those relevant to the query, which drastically speeds up many time-series workloads.
While it’s reasonable to assume that you often know this kind of information beforehand, unfortunately,
TimescaleDB’s Hypertables are not without tradeoffs:
- Foreign key constraints are not supported between hypertables.
- UPDATE statements that would move a row between chunks are not allowed.
- All columns used for partitioning dimensions must be included in a unique index.
- The time column used for partitioning cannot contain NULL values.
While these tradeoffs are acceptable for many workloads, the question is: Do you need to make them?
So, Where’s the Secret Sauce with CedarDB?
Having seen how a use-case specific database system can make workloads beginner friendly,
let’s take a look at how CedarDB manages to play a different sport entirely.
How do I account for the vastly different performance results I observed for this workload?
After all, TimescaleDB is specifically built to tackle just this type of time
series data challenge.
While the performance increase is a mix of a lot of innovations,
I want to highlight the one aspect I think best maps to what Hypertables do in TimescaleDB, the concept of “early pruning”.
When scanning a table, CedarDB manages to check many predicates on metadata only, avoiding to scan blocks that don’t qualify entirely.
In this instance, since a timestamp is just a 64 bit integer, timestamp comparisons are just integer comparisons internally, and
CedarDB is able to just skip most data blocks when filtering for a time range, like the heavy-hitter query does.
How is it able to achieve this? Here are some key aspects contributing to that:
- CedarDB stores large chunks of data together into compressed blocks.
- CedarDB stores statistics for each block (e.g., max value, min value, count).
- If there is no value in a given data block that qualifies, per the query predicate, CedarDB avoids scanning that block.
- Time-series data is often loosely inserted in timestamp order. Even if the order is not perfect, overall timestamps close together are stored in the same block.
- Filters are especially useful for pruning in a time series data workload, where it’s common to filter by a timestamp column (e.g., all values of the last month), as illustrated here.
Sounds familiar? That is exactly the effect Hypertables have in TimescaleDB, just on all columns instead of one and without needing to specify anything beforehand.
And the beauty of it is that you get all this without sacrificing flexibility in other areas.
Some Takeaways
What else did I learn from this exercise, besides that I need to get ready for skiing season? Here’s what I’ve jotted down:
-
The observations represented by the graphs up above inspired me to continue,
with more data, so I moved the test jig off my M4 MacBook and onto an
r7i.8xlarge instance in EC2. There, I gradually increased the row count in
this table (via a data generator), from 25 M rows, where I recorded a
cumulative runtime (750 queries) of 79 ms, to 1.13 billion rows, where the
runtime was 96 ms. That 1.13B rows is 180x the size of the data point
represented by the right-most point in the graph, yet the query runtime in
CedarDB remained essentially constant.
-
… So, while CedarDB is amazing on the 8 vCPU & 32 GB RAM of my MacBook Docker
setup, when it has access to the 32 vCPU & 128 GB of RAM of the larger EC2
instance, it takes full advantage of all that CPU and memory.
-
The efficiency here takes on two forms: (1) you pay less for IaaS or data center
and (2) simplification through simplified DDL, fewer indexes, and the ability
to consolidate multiple workloads onto a single, general purpose database.
-
Finally, this radical performance boost may embolden you to consider some new use
cases you may have been hesitant to explore using your existing database engine.
Note: If the folks we’re collaborating with on this give us the “OK”, we’ll share the code used here and update this with links.
Call to action: If you’re curious and want to give CedarDB a try in AWS, the CloudFormation
template
that I used to automate setup of my CedarDB in EC2 can help you get started pretty easily.
Thank you!
If you’d like to discuss your data challenges with us, we’d love to hear from you.
Appendix: The Nuts and Bolts
Show Benchmark Setup
To investigate how this workload would perform in CedarDB, we needed a few things:
- Data generator: Pair programming alongside ChatGPT knocked this out quickly, in Python. It was designed based on the DDL for two tables which were key to a different query, along with the queries themselves, to ensure our joins resolved properly. Ultimately, we relied only on the data for the single table,
track_plays (DDL below).
- Load simulator: Again, ChatGPT came in handy and rendered a mostly-working version of a multi-threaded Go/pgx client.
- A couple of shell scripts for running tests and loading data.
- Our load parameters: We opted to simulate 25 runs across 30 threads, yielding 750 total queries.
- Docker containers for CedarDB, TimescaleDB, and PostgreSQL
Here’s the DDL for track_plays, in TimescaleDB. A couple of things to note:
- The table uses TimescaleDB’s hypertable, which is described here,
to partition the data by the
time column, using hourly partitions (our generated data covered a single day), allowing queries
to target only the relevant chunks, drastically improving performance compared to a standard, unpartitioned table, especially for large time-series datasets.
- The DDL provided to us included these four indexes, so we included these for Postgres and TimescaleDB.
CREATE TABLE public.track_plays
(
"time" timestamp with time zone NOT NULL,
user_id text NOT NULL,
session_id text NOT NULL,
track_id text NOT NULL,
duration_sec integer NOT NULL,
source TEXT DEFAULT 'default_source'::TEXT NOT NULL,
end_time timestamp with time zone
)
WITH
(
timescaledb.hypertable,
timescaledb.partition_column='time',
timescaledb.chunk_interval='1 hour'
);
CREATE INDEX ON public.track_plays USING btree (end_time);
CREATE INDEX ON public.track_plays USING btree (session_id, "time" DESC);
CREATE INDEX ON public.track_plays USING btree ("time", end_time, duration_sec);
CREATE INDEX ON public.track_plays USING btree ("time", user_id, end_time, duration_sec);
In PostgreSQL, our DDL was similar, but didn’t contain the WITH ... portion defining the hypertable.
Finally, in CedarDB, the DDL from Postgres reduces to just this (note the absence of indexes):
CREATE TABLE public.track_plays
(
"time" timestamp with time zone NOT NULL,
user_id text NOT NULL,
session_id text NOT NULL,
track_id text NOT NULL,
duration_sec integer NOT NULL,
source TEXT DEFAULT 'default_source'::TEXT NOT NULL,
end_time timestamp with time zone
);
This is a sample row of generated data:
postgres=# select * from track_plays order by random() limit 1;
time | user_id | session_id | track_id | duration_sec | source | end_time
------------------------+--------------------+------------------+-------------+--------------+--------+------------------------
2025-11-01 11:40:35+00 | fransisco.jolliffe | 6d473b55db1c71cb | track_85767 | 395 | music | 2025-11-01 11:47:10+00
(1 row)
The Bash script that drove the test runs:
#!/bin/bash
# This will print -runs X -threads lines of output; e.g. for 25 runs and 30 threads, there will be 750 lines of output
if [ -z ${DB_URL+x} ]
then
echo " DB_URL must be set"
echo ""
echo " Example: export DB_URL=\"postgresql://postgres:postgres@localhost:5432/postgres?sslmode=require\""
echo ""
exit 1
fi
# Build Go app if the EXE isn't present here
[ -x ./go-pgx-tester ] || go mod tidy && go build .
# Obtain initial and final time bounds from generated data
t0=$( psql "$DB_URL" -c "\\timing off" -tAc "select min("time") from track_plays;" | tail -1 | perl -ne 's/ /T/; s/\+00$/Z/; print;' )
t1=$( psql "$DB_URL" -c "\\timing off" -tAc "select max("time") from track_plays;" | tail -1 | perl -ne 's/ /T/; s/\+00$/Z/; print;' )
# List of users
users=$( psql "$DB_URL" -c "\\timing off" -tAc "with a as (select distinct(user_id) from track_plays) select string_agg(user_id, ',') from a;" | tail -1 )
./go-pgx-tester \
-db "$DB_URL" \
-start $t0 \
-end $t1 \
-max-dur 450 \
-min-dur 30 \
-runs 25 \
-sleep 100us \
-threads 30 \
-users $users
# Print which DB ran the test
psql "$DB_URL" -c "\\timing off" -tAc "select version();" | tail -1
# Print number of rows in the table
n_rows=$( psql "$DB_URL" -c "\\timing off" -tAc "select count(*) from track_plays;" | tail -1 )
echo "Row count: $n_rows"
Here’s the progression of data generation used to create the graphs:
$ ./data_gen_track_plays_sleep_states_duration.py --users 100 --sessions-per-user 15 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only
$ ./data_gen_track_plays_sleep_states_duration.py --users 100 --sessions-per-user 30 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only
$ ./data_gen_track_plays_sleep_states_duration.py --users 100 --sessions-per-user 60 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only
$ ./data_gen_track_plays_sleep_states_duration.py --users 140 --sessions-per-user 75 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only
$ ./data_gen_track_plays_sleep_states_duration.py --users 281 --sessions-per-user 75 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only
$ ./data_gen_track_plays_sleep_states_duration.py --users 570 --sessions-per-user 75 --sleep-cadence-sec 5 --minutes-per-session 720 --start '2025-11-01T00:00:00Z' --sleep-csv sleep_states.csv --tracks-csv track_plays.csv --tracks-only
Data loading, TimescaleDB:
$ psql "postgresql://postgres:postgres@localhost:15432/postgres?sslmode=disable" -c "COPY track_plays FROM stdin DELIMITER ',' NULL '\\N' ESCAPE '\' HEADER CSV" < track_plays.csv
Timing is on.
COPY 6220985
Time: 54162.184 ms (00:54.162)
PostgreSQL:
$ psql "postgresql://postgres:postgres@localhost:15432/postgres?sslmode=disable" -c "COPY track_plays FROM stdin DELIMITER ',' NULL '\\N' ESCAPE '\' HEADER CSV" < track_plays.csv
Timing is on.
COPY 6220985
Time: 60835.740 ms (01:00.836)
And CedarDB:
$ psql "postgresql://postgres:postgres@localhost:5432/postgres?sslmode=require" -c "COPY track_plays FROM stdin DELIMITER ',' NULL '\\N' ESCAPE '\' HEADER CSV" < track_plays.csv
Timing is on.
COPY 6220985
Time: 10038.192 ms (00:10.038)
What I noticed: CedarDB’s load time is less than 10% of the time taken by either TimescaleDB or Postgres,
which isn’t surprising given CedarDB doesn’t require any indexes for this workload, so it avoids the costs
associated with them.
For each of the three DB’s, and for each of the data sizes we measured, four runs were measured; the average
of these values appears in the data table below. Here is an example test run:
What the Test Runs Looked Like:
$ ./run_test.sh
Starting workload: threads=30 runs/thread=25 time=[2025-11-01T00:00:00Z .. 2025-11-02T11:59:55Z] dur=[30..450] users=460 timeout=15s
2025/11/14 10:58:07 [w15] ok run=0 uid=milana.katsma time=[2025-11-01T06:51:43Z..2025-11-01T07:17:03Z] dur=[301] count=131 latency=4.547542ms
2025/11/14 10:58:07 [w22] ok run=0 uid=robert.gottreich time=[2025-11-02T07:25:17Z..2025-11-02T08:10:53Z] dur=[38] count=609 latency=4.853083ms
...
2025/11/14 10:58:14 [w06] ok run=24 uid=con.gustafson time=[2025-11-01T08:24:13Z..2025-11-01T08:50:51Z] dur=[274] count=358 latency=614.541µs
2025/11/14 10:58:14 [w15] ok run=24 uid=matute.gluck time=[2025-11-02T02:22:44Z..2025-11-02T02:44:23Z] dur=[59] count=507 latency=580.792µs
Done in 130.279042ms
PostgreSQL 16.3 compatible CedarDB v2025-11-06
Row count: 6220985
And the resulting data, from which the graphs were created:
| N_Rows |
Postgres (ms) |
TimescaleDB (ms) |
CedarDB (ms) |
| 218,380 |
248 |
128 |
127 |
| 436,700 |
523 |
183 |
130 |
| 873,437 |
853 |
301 |
133 |
| 1,528,447 |
1267 |
488 |
136 |
| 3,067,366 |
6932 |
1312 |
127 |
| 6,220,985 |
15180 |
2847 |
132 |