February 13, 2025
How GaadiBazaar reduced database costs by 40% with Aurora MySQL Serverless
Ship data as you ship code: Deploy changes with a single command
Self-Hosting with Convex: Everything You Need to Know
February 11, 2025
MongoDB Equality, Sort, Range (ESR) without Equality (SR): add an unbounded range predicate on the indexed sort field
In a previous post. I applied the ESR rule to multiple databases. I used a simplified example with no equality predicate to focus on covering the Sort and Range by the index. However, for the demo in MongoDB, I added an Equality predicate that doesn't filter anything, {$gt:MinKey}, to get a full Equality, Sort, Range.
On my first day as a Developer advocate for MongoDB, I discussed this with the product managers, and the reason and possible optimization are tracked in the improvement idea Tighten index bounds and allow compound index to be chosen when predicate on leading field is not provided.
In this article, I'm showing it, with a simple workaround.
A table to experiment with the ESR rule
Here is a collection with fields that I'll use for queries with Equality, Sort, Range:
mdb> db.demoesr.drop();
mdb> db.demoesr.insertMany([
{ e:42, s:"a", r:10 },
{ e:42, s:"b", r:20 },
{ e:42, s:"b", r:10 },
{ e:42, s:"d", r:30 },
{ e:42, r:40 } // add one doc with no sort value
]);
Index and query on Equality, Sort, Range (ESR)
I create a compound index on fields "e", "s", and "r", in the order recommended by the ESR rule:
mdb> // index on Equality, Sort, Range
mdb> db.demoesr.createIndex(
{ e: 1 , s: 1, r : 1 }
);
e_1_s_1_r_1
I query with an equality predicate on "e", a range pI execute a query with an equality predicate on "e", a range predicate on "r", and a sort on "s", along with a projection of only indexed fields to ensure the query is fully covered by the index (the id must be explicitly excluded from the projection "_id":0):
mdb> db.demoesr.find(
{ e:{$eq: 42} ,r:{$gt:10} } , {e:1,s:1,r:1,"_id":0 }
).sort({s:1});
[
{ e: 42, s: null, r: 40 },
{ e: 42, s: 'b', r: 20 },
{ e: 42, s: 'd', r: 30 }
]
I know that the result comes from the index entry directly, without fetching the document, because I can see s: null for the document where "s" doesn't exist. The index has a value for all index key fields, with a null value when it does not exist in the document.
I confirm from the execution plan that there's no FETCH and no documents were read:
mdb> db.demoesr.find(
{ e:{$eq: 42} ,r:{$gt:10} } , {e:1,s:1,r:1,"_id":0 }
).sort({s:1}).explain("executionStats").executionStats;
{
nReturned: 3,
totalKeysExamined: 5,
totalDocsExamined: 0,
executionStages: {
stage: 'PROJECTION_COVERED',
nReturned: 3,
transformBy: { e: 1, s: 1, r: 1, _id: 0 },
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
keyPattern: { e: 1, s: 1, r: 1 },
indexName: 'e_1_s_1_r_1',
direction: 'forward',
indexBounds: {
e: [ '[42, 42]' ],
s: [ '[MinKey, MaxKey]' ],
r: [ '(10, inf.0]' ]
},
keysExamined: 5,
seeks: 3,
}
}
This is the perfect execution plan for such queries as Equality, Sort, and Range were pushed down to the IXSCAN:
-
e: [ '[42, 42]' ]is first in the index bounds and returns a single range for the filter on "e". The range holdskeysExamined: 5index entries. -
s: [ '[MinKey, MaxKey]' ]is second in the index bounds and returns this range ordered by "s", still returningkeysExamined: 5index entries. -
r: [ '(10, inf.0]' ]cannot use a single seek to get this range because there are multiple values in the preceding "s" but the filter is applied to the index entries and returnsnReturned: 3index entries.
Additionally, because all fields in the projection are in the index entries, no documents had to be fetched: totalDocsExamined: 0
This is a perfect example of the ESR rule with a covering index where avoiding a sort is the goal.
In my example, I've only one value for "e". Let's remove the equality condition.
Index and query on Sort, Range (SR)
I want to run the same without an equality predicate. To avoid a sort operation, the index starts with the "s" column:
mdb> // index on Sort, Range
mdb> db.demoesr.createIndex(
{ s: 1, r : 1 }
);
s_1_r_1
Here is a query with only a Range and Sort:
mdb> db.demoesr.find(
{ r:{$gt:10} } , {e:1,s:1,r:1,"_id":0 }
).sort({s:1});
[
{ e: 42, r: 40 },
{ e: 42, s: 'b', r: 20 },
{ e: 42, s: 'd', r: 30 }
]
The non-existent "s" is absent in the result, instead of being shown as null, which indicates that what is displayed was retrieved from the document rather than the index entry.
I verify this with the execution plan:
mdb> db.demoesr.find(
{ r:{$gt:10} } , {e:1,s:1,r:1,"_id":0 }
).sort({s:1}).explain("executionStats").executionStats;
n Returned: 3,
totalKeysExamined: 5,
totalDocsExamined: 5,
executionStages: {
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
transformBy: { e: 1, s: 1, r: 1, _id: 0 },
inputStage: {
stage: 'FETCH',
filter: { r: { '$gt': 10 } },
nReturned: 3,
docsExamined: 5,
inputStage: {
stage: 'IXSCAN',
nReturned: 5,
keyPattern: { s: 1, r: 1 },
indexName: 's_1_r_1',
direction: 'forward',
indexBounds: { s: [ '[MinKey, MaxKey]' ], r: [ '[MinKey, MaxKey]' ] },
keysExamined: 5,
seeks: 1,
This is not good because no filtering has been applied to the IXSCAN: : [ '[MinKey, MaxKey]' and docsExamined: 5 have been fetched, more than what we need for the result: nReturned: 3.
Adding an unbounded range on the sort field
The workaround is to get back to an ESR query with a predicate on the first column, which covers the sort. I can't use an equality predicate as it may have multiple values, and use MinKey or maxKey to get a full range on this value:
mdb> db.demoesr.find(
{ s:{$gte:MinKey}, r:{$gt:10} } , {e:1,s:1,r:1,"_id":0 }
).sort({s:1});
[
{ e: 42, r: 40 },
{ e: 42, s: 'b', r: 20 },
{ e: 42, s: 'd', r: 30 }
]
After my earlier observation that a fully covered query would show s: null coming from the index entry, the documents were retrieved from the collection. This is necessary because "e" is not in the index used here, so the document must be fetched and examined to obtain its value for the projection.
I need to display the execution plan to grasp the benefits of adding a dummy filter to the sort field, when it is the first field of the index:
mdb> db.demoesr.find(
{ s:{$gte:MinKey}, r:{$gt:10} } , {e:1,s:1,r:1,"_id":0 }
).sort({s:1}).explain("executionStats").executionStats;
{
nReturned: 3,
executionTimeMillis: 0,
totalKeysExamined: 5,
totalDocsExamined: 3,
executionStages: {
stage: 'PROJECTION_SIMPLE',
nReturned: 3,
transformBy: { e: 1, s: 1, r: 1, _id: 0 },
inputStage: {
stage: 'FETCH',
nReturned: 3,
docsExamined: 3,
inputStage: {
stage: 'IXSCAN',
nReturned: 3,
keyPattern: { s: 1, r: 1 },
indexName: 's_1_r_1',
direction: 'forward',
indexBounds: { s: [ '[MinKey, MaxKey]' ], r: [ '(10, inf.0]' ] },
keysExamined: 5,
seeks: 3,
The index bound on the first column used for Sort is the same: s: [ '[MinKey, MaxKey]' ], which confirms that adding s:{$gte:MinKey} didn't change the number of index entries (keysExamined: 5). What has changed is that the range predicate is now pushed down to the IXSCAN as r: [ '(10, inf.0]' ], and only docsExamined: 3 documents have been fetched to get the nReturned: 3 documents for the result.
Even if I've added a range predicate using MinKey or MaxKey, it is equivalent to the Equality in the ESR rule because what matters is that it returns a single range, ensuring that it is sorted on the following field of the composite index.
In summary, the ESR (Equality, Sort, Range) Rule outlined by MongoDB is an excellent framework for designing your composite indexes. However, it's essential to review the execution plan to ensure you meet your objective: avoiding the retrieval of too many documents.
Vector indexes, MariaDB & pgvector, large server, dbpedia-openai dataset
This post has results from ann-benchmarks to compare MariaDB and Postgres with a larger dataset, dbpedia-openai at 100k, 500k and 1M rows. It has 1536 dimensions and uses angular (cosine) as the distance metric. By larger I mean by the standards of what is in ann-benchmarks. This work was done by Small Datum LLC and sponsored by the MariaDB Corporation.
tl;dr
- Index create time was much less for MariaDB in all cases except the result for recall >= 0.95
- For a given recall, MariaDB gets between 2.1X and 2.7X more QPS than Postgres
This post has much more detail about my approach in general. I ran the benchmark for 1 session. I use ann-benchmarks via my fork of a fork of a fork at this commit. The ann-benchmarks config files are here for MariaDB and for Postgres.
The command lines to run the benchmark using my helper scripts are:
bash rall.batch.sh v1 dbpedia-openai-100k-angular c32r128
These charts show the best QPS for a given recall. MariaDB gets about 2X more QPS than Postgres for a specific recall level
With 100k rows
With 500k rows
With 1M rows
Results: create index
- index sizes are similar between MariaDB and pgvector with halfvec
- time to create the index varies a lot and it is better to consider this in the context of recall which is done in next section
- M - value for M when creating the index
- cons - value for ef_construction when creating the index
- secs - time in seconds to create the index
- size(MB) - index size in MB
With ann-benchmarks the constraint is recall. Below I share the best QPS for a given recall target along with the configuration parameters (M, ef_construction, ef_search) at which that occurs for each of the algorithms (MariaDB, pgvector with float32, pgvector with float16/halfvec).
- Postgres does not get recall=1.0 for the values of M, ef_construction and ef_search I used
- Index create time was much less for MariaDB in all cases except the result for recall >= 0.95
- For a given recall target, MariaDB gets between 2.1X and 2.7X more QPS than Postgres
- recall, QPS - best QPS at that recall
- isecs - time to create the index in seconds
- m= - value for M when creating the index
- ef_cons= - value for ef_construction when creating the index
- ef_search= - value for ef_search when running queries
February 10, 2025
Simplify database authentication management with the Amazon Aurora PostgreSQL pg_ad_mapping extension
February 09, 2025
Aurora DSQL is different than Aurora, but Aurora DSQL belongs to Aurora (which belongs to RDS)
The term "Aurora DSQL" can be somewhat confusing, as it shares its name with another RDS database called "Aurora". This lack of distinction complicates discussions about the two. How would you refer to the original Aurora when it is neither "serverless" nor "limitless"?
The "non-DSQL" Aurora is PostgreSQL-compatible (APG -Aurora PostgreSQL) but is not a distributed database. It operates with a single instance that can handle your reads and writes and experiences downtime during failover or major upgrades.
In contrast, Aurora DSQL is distributed but offers minimal compatibility with PostgreSQL, as exposed in the previous posts of this series.
From a user's perspective, these two databases are quite different. Technically, DSQL storage does not use the multi-AZ storage, internally called "Grover", or simply "Aurora storage".
However, from a managed service perspective, it makes complete sense for AWS to offer DSQL under the umbrella of Amazon Aurora. I'll explain why, but remember that while the other posts in this series are technical and fact-based, this one is subjective and represents an opinion.
All Amazon RDS (the relational database services of AWS) databases are compatible with databases outside of Amazon. Aurora is compatible with MySQL or PostgreSQL. From an application standpoint, there is no difference.
Finding customers for a new database that behaves differently than any existing one would be tough. Amazon DynamoDB and Google Spanner are exceptions to this. Still, they succeeded because they were used internally by the cloud provider: DynamoDB powers Amazon (it all started with the shopping cart), and Google critical applications rely on Spanner (AdWords, YouTube). Many AWS customers also use DynamoDB because it is unique for specific use cases, but Spanner has never gained significant traction among Google customers. Customers prefer the freedom to migrate their applications to other cloud vendors without re-designing and coding.
Amazon DSQL combines the advantages of DynamoDB (restricting user actions to ensure predictable performance and scalability) and Spanner (integrating some SQL features like ACID transactions and relational data modeling). However, it is not equivalent to any existing database, employing Optimistic Concurrency Control with some SQL features (transactions, tables, and joins) but lacking others (no foreign key, no long transactions). No customer would place their core business data in a database with no equivalent on-premises or another cloud vendor. But they can if it is Aurora, previously known as a PostgreSQL alternative.
By cataloging DSQL under the Aurora umbrella, the issue is resolved. A multi-region resilient and elastic database attracts customers. They can begin building their applications on it and present it as PostgreSQL. Quickly, they encountered one of its many limitations. They could either accept the necessary workarounds in application code to continue using the distributed database or voice concerns about its lack of PostgreSQL compatibility. At this point, AWS can redirect them to the non-DSQL alternatives: Aurora (APG), Aurora Serverless, or Aurora Limitless.
As a distinct database service, the low adoption of DSQL would make the service unprofitable. However, as a variation of the Aurora engine, it can still be profitable despite low or short-term adoption, as it can direct users to the non-DSQL Aurora whenever they require PostgreSQL-compatible behavior. Developers can experiment with building new application services on Aurora DSQL and decide whether to remain with DSQL or switch to another version of Aurora later. Managers can consider AWS as an alternative to Google, with a Spanner equivalent. Ultimately, it’s about choosing the cloud infrastructure vendor, regardless of which database engine goes to production. It can also be managed by an AWS partner, like YugabyteDB, which is distributed, functions like PostgreSQL, and is multi-cloud, but the customer who started on AWS can remain on Amazon cloud infrastructure.
Aurora DSQL is technically different from the existing Aurora database engine. Still, from a user point of view, and with good marketing, it can be seen as a switch in the Aurora database service that reduces PostgreSQL compatibility to provide horizontal scalability, like the switch to serverless, when cost is more critical than performance predictability, to limitless, when a sharding key is possible for all use cases, or to compatibility extensions, like Babelfish. Amazon RDS is a marketplace for relational databases, and Amazon Aurora is a dedicated stall offering different flavors of PostgreSQL.
February 07, 2025
Outgrowing Postgres: When to move OLAP workloads off Postgres
Outgrowing Postgres: When to move OLAP workloads off Postgres
February 06, 2025
Hanging in there
I have been reviewing papers for USENIX ATC and handling work stuff at MongoDB Research. I cannot blog about either yet. So, instead of a paper review or technical blog, I share some random observations and my current mood. Bear with me as I vent here. You may disagree with some of my takes. Use the comments to share your thoughts.
Damn, Buffalo. Your winter is brutal and depressing. (Others on r/Buffalo also suffer; many suggest video games, drugs, or drinking.) After 20 years of Buffalo winters, I am fed up with the snow and cold. When I taught distributed systems course in fall semesters, I would ask the new batch of Indian students how many had never seen snow, and all hands would shoot up. I would warn them by winter's end they would despise that magic fairy dust. Ugh, sidewalks are already piled high with snow that freezes, muddies, and decays into holes.
Forgive the gloomy start. I had a bad flu ten days ago. Just as I began to recover, another milder bout struck. My joints hurt, and I still feel miserable. OK Murat... Breathe... And keep writing and create. Only 90 more days until spring. Well, at least Buffalo doesn't have any wildfire risk.
My ADHD flares up again. I missed two weeks at the gym due to the flu. Winter and politics do not help either. I now use my old iPad to read and write longhand to get some focus and calm. I am eyeing the Remarkable Paper Pro. We need more calm technology.
I guess I will have to practice Feynman's active social irresponsibility for sometime to keep my sanity, and do better for my company, work, and family. I no longer follow news and politics. I would have a lot more to say about this, but maybe for another time.
Ok, some tech and research trends. I wrote a series about the use of time in distributed systems. Watch this space. RDMA and hardware acceleration remain strong. Formal methods techniques/applications are surging to augment, synergize, and safeguard AI. Why can't we see more work on the usability area? Especially, for higher-level abstractions for programming distributed sytems. With AI, can this topic finally see some boost?
DeepSeek is nice, but half the time it is unavailable. This made me go back to ChatGPT, which seems to be improving every week. I also think Gemini is underrated. I like Google's AI overview for search: it keeps getting better and insightful, and it's blazing fast.
I am quite optimistic about AI. I hope this doesn't come across as bragging, but "asking questions" is my superpower, which I've cultivated throughout my life. So I don't feel threatened by AI. If anything, I feel empowered.
College applications and decisions in the US are broken. I think universities came up with this "holistic review/admission" thing just because they won't get sued for rejecting a much more qualified candidate over a less qualified one. I am not sure they can do a thorough job of evaluating applicants when there are more than 90 thousand applications to go through in less than two months. Why isn't there a matching system like National Resident Matching Program in place already? Colleges are antiquated. They must improve, especially with AI breathing down their necks for disruption.
I submitted a HYTRADBOI talk. What a funny and clever name. Hyderabad Boi? Jamie does a great job organizing this singlehandedly. He makes it seem easy, which I am sure it is not. I hope Jamie publishes his runbook, so others can copy this model, and thousand more flowers bloom.
Any new, banger sci-fi books or series? Kindly suggest some fresh reads. I need quality sci-fi to survive the winter. With AI, there should be a new genre of sci-fi books coming up, no?
These smart glasses looked cool, with AI and all. (I keep adding "with AI" to everything. The age of AI --not AGI yet-- truly arrived.) But I changed my mind. I want technology that helps me focus and get calmer, rather than distract me further. I already get enough distractions.
I drafted this post with iPad. I should schedule more time away from my laptop and thinking/writing on paper.
There... I have shared my complaints and vented. I feel better. Thanks for commiserating.