MongoDB is used for its strength in managing online transaction processing (OLTP) with a document model that naturally aligns with domain-specific transactions and their access patterns. In addition to these capabilities, MongoDB supports advanced search techniques through its Atlas Search index, based on Apache Lucene. This can be used for near-real-time analytics and, combined with aggregation pipeline, add some online analytical processing (OLAP) capabilities. Thanks to the document model, this analytics capability doesn't require a different data structure and enables MongoDB to execute hybrid transactional and analytical (HTAP) workloads efficiently, as demonstrated in this article with an example from a healthcare domain.
Traditional relational databases employ a complex query optimization method known as "star transformation" and rely on multiple single-column indexes, along with bitmap operations, for efficient ad-hoc queries. This requires a dimensional schema, or star schema, which differs from the normalized operational schema updated by transactions. In contrast, MongoDB can be queried with a similar strategy using its document schema for operational use cases, simply requiring the addition of an Atlas Search index on the collection that stores transaction facts.
To demonstrate how a single index on a fact collection enables efficient queries even when filters are applied to other dimension collections, I utilize the MedSynora DW dataset, which is similar to a star schema with dimensions and facts. This dataset, published by M. Ebrar Küçük on Kaggle, is a synthetic hospital data warehouse covering patient encounters, treatments, and lab tests, and is compliant with privacy standards for healthcare data science and machine learning.
Import the dataset
The dataset is accessible on Kaggle as a folder of comma-separated values (CSV) files for dimensions and facts compressed into a 730MB zip file. The largest fact table that I'll use holds 10 million records.
I download the CSV files and uncompress them:
curl -L -o medsynora-dw.zip "https://www.kaggle.com/api/v1/datasets/download/mebrar21/medsynora-dw"
unzip medsynora-dw.zip
I import each file into a collection, using mongoimport from the MongoDB Database Tools:
for i in "MedSynora DW"/*.csv
do
mongoimport -d "MedSynoraDW" --file="$i" --type=csv --headerline -c "$(basename "$i" .csv)" -j 8
done
For this demo, I'm interested in two fact tables: FactEncounter and FactLabTest. Here are the fields described in the file headers:
# head -1 "MedSynora DW"/Fact{Encounter,LabTests}.csv
==> MedSynora DW/FactEncounter.csv <==
Encounter_ID,Patient_ID,Disease_ID,ResponsibleDoctorID,InsuranceKey,RoomKey,CheckinDate,CheckoutDate,CheckinDateKey,CheckoutDateKey,Patient_Severity_Score,RadiologyType,RadiologyProcedureCount,EndoscopyType,EndoscopyProcedureCount,CompanionPresent
==> MedSynora DW/FactLabTests.csv <==
Encounter_ID,Patient_ID,Phase,LabType,TestName,TestValue
The fact tables reference the following dimensions:
# head -1 "MedSynora DW"/Dim{Disease,Doctor,Insurance,Patient,Room}.csv
==> MedSynora DW/DimDisease.csv <==
Disease_ID,Admission Diagnosis,Disease Type,Disease Severity,Medical Unit
==> MedSynora DW/DimDoctor.csv <==
Doctor_ID,Doctor Name,Doctor Surname,Doctor Title,Doctor Nationality,Medical Unit,Max Patient Count
==> MedSynora DW/DimInsurance.csv <==
InsuranceKey,Insurance Plan Name,Coverage Limit,Deductible,Excluded Treatments,Partial Coverage Treatments
==> MedSynora DW/DimPatient.csv <==
Patient_ID,First Name,Last Name,Gender,Birth Date,Height,Weight,Marital Status,Nationality,Blood Type
==> MedSynora DW/DimRoom.csv <==
RoomKey,Care_Level,Room Type
Here is the dimensional model, often referred to as a "star schema" because the fact tables are located at the center, referencing the dimensions. Because of normalization, when facts contain a one-to-many composition it is described in two CSV files to fit into two SQL tables:
Star schema with facts and dimensions. The facts are stored in two tables in CSV files or a SQL database, but on a single collection in MongoDB. It holds the fact measures and dimension keys, which reference the key of the dimension collections.
MongoDB allows the storage of one-to-many compositions, such as Encounters and LabTests, within a single collection. By embedding LabTests as an array in Encounter documents, this design pattern promotes data colocation to reduce disk access and increase cache locality, minimizes duplication to improve storage efficiency, maintains data integrity without requiring additional foreign key processing, and enables more indexing possibilities. The document model also circumvents a common issue in SQL analytic queries, where joining prior to aggregation may yield inaccurate results due to the repetition of parent values in a one-to-many relationship.
As this would be the right data model for an operational database with such data, I create a new collection, using an aggregation pipeline, that I'll use instead of the two that were imported from the normalized CSV:
db.FactLabTests.createIndex({ Encounter_ID: 1, Patient_ID: 1 });
db.FactEncounter.aggregate([
{
$lookup: {
from: "FactLabTests",
localField: "Encounter_ID",
foreignField: "Encounter_ID",
as: "LabTests"
}
},
{
$addFields: {
LabTests: {
$map: {
input: "$LabTests",
as: "test",
in: {
Phase: "$$test.Phase",
LabType: "$$test.LabType",
TestName: "$$test.TestName",
TestValue: "$$test.TestValue"
}
}
}
}
},
{
$out: "FactEncounterLabTests"
}
]);
Here is how one document looks:
AtlasLocalDev atlas [direct: primary] MedSynoraDW>
db.FactEncounterLabTests.find().limit(1)
[
{
_id: ObjectId('67fc3d2f40d2b3c843949c97'),
Encounter_ID: 2158,
Patient_ID: 'TR479',
Disease_ID: 1632,
ResponsibleDoctorID: 905,
InsuranceKey: 82,
RoomKey: 203,
CheckinDate: '2024-01-23 11:09:00',
CheckoutDate: '2024-03-29 17:00:00',
CheckinDateKey: 20240123,
CheckoutDateKey: 20240329,
Patient_Severity_Score: 63.2,
RadiologyType: 'None',
RadiologyProcedureCount: 0,
EndoscopyType: 'None',
EndoscopyProcedureCount: 0,
CompanionPresent: 'True',
LabTests: [
{
Phase: 'Admission',
LabType: 'CBC',
TestName: 'Lymphocytes_abs (10^3/µl)',
TestValue: 1.34
},
{
Phase: 'Admission',
LabType: 'Chem',
TestName: 'ALT (U/l)',
TestValue: 20.5
},
{
Phase: 'Admission',
LabType: 'Lipids',
TestName: 'Triglycerides (mg/dl)',
TestValue: 129.1
},
{
Phase: 'Discharge',
LabType: 'CBC',
TestName: 'RBC (10^6/µl)',
TestValue: 4.08
},
...
In MongoDB, the document model utilizes embedding and reference design patterns, resembling a star schema with a primary fact collection and references to various dimension collections. It is crucial to ensure that the dimension references are properly indexed before querying these collections.
Atlas Search index
Search indexes are distinct from regular indexes, which rely on a single composite key, as they can index multiple fields without requiring a specific order to establish a key. This feature makes them perfect for ad-hoc queries, where the filtering dimensions are not predetermined.
I create a single Atlas Search index that encompasses all dimensions or measures that I might use in predicates, including those found in an embedded document:
db.FactEncounterLabTests.createSearchIndex(
"SearchFactEncounterLabTests", {
mappings: {
dynamic: false,
fields: {
"Encounter_ID": { "type": "number" },
"Patient_ID": { "type": "token" },
"Disease_ID": { "type": "number" },
"InsuranceKey": { "type": "number" },
"RoomKey": { "type": "number" },
"ResponsibleDoctorID": { "type": "number" },
"CheckinDate": { "type": "token" },
"CheckoutDate": { "type": "token" },
"LabTests": {
"type": "document" , fields: {
"Phase": { "type": "token" },
"LabType": { "type": "token" },
"TestName": { "type": "token" },
"TestValue": { "type": "number" }
}
}
}
}
}
);
Since I don't need extra text searching on the keys, I designate the character string ones as token. I label the integer keys as number. Generally, the keys are utilized for equality predicates. However, some can be employed for ranges when the format permits, such as check-in and check-out dates formatted as YYYY-MM-DD.
In a relational database, the star schema approach emphasizes the importance of limiting the number of columns in the fact tables, as they typically contain numerous rows. Smaller dimension tables can hold more columns and are typically denormalized in SQL databases (favoring a star schema over a snowflake schema). Likewise, in document modeling, incorporating all dimension fields would unnecessarily increase the size of the fact collection documents, making it more straightforward to reference the dimension collection. The general principles of data modeling in MongoDB enable querying it as a star schema without requiring extra consideration, as MongoDB databases are designed for the application access patterns.
Star query
A star schema allows processing queries which filter fields within dimension collections in several stages:
- In the first stage, filters are applied to the dimension collections to extract all dimension keys. These keys typically do not require additional indexes, as the dimensions are generally small in size.
- In the second stage, a search is conducted using all previously obtained dimension keys on the fact collection. This process utilizes the search index built on those keys, allowing for quick access to the required documents.
- A third stage may retrieve additional dimensions to gather the necessary fields for aggregation or projection. This multi-stage process ensures that the applied filter reduces the dataset from the large fact collection before any further operations are conducted.
For an example query, I aim to analyze lab test records for female patients who are over 170 cm tall, underwent lipid lab tests, have insurance coverage exceeding 80%, and were treated by Japanese doctors in deluxe rooms for hematological conditions.
Search aggregation pipeline
To optimize the fact collection process and apply all filters, I will begin with a simple aggregation pipeline that starts with a search on the search index. This allows for filters to be applied directly to the fact collection's fields, while additional filters will be incorporated in stage one of the star query. I utilize a local variable with a compound operator to facilitate the addition of more filters for each dimension in stage one of the star query.
Before going though the star query stages to add filters on dimensions, my query has a filter on the lab type which is in the fact collection, and indexed.
const search = {
"$search": {
"index": "SearchFactEncounterLabTests",
"compound": {
"must": [
{ "in": { "path": "LabTests.LabType" , "value": "Lipids" } },
]
},
"sort": { CheckoutDate: -1 }
}
}
I have added a "sort" operation to sort the result by check-out date in descending order. This illustrates the advantage of sorting during the index search rather than in later steps of the aggregation pipeline, particularly when a "limit" is applied.
I'll use this local variable to add more filters in Stage 1 of the star query, so that it can be executed for Stage 2, and collect documents for Stage 3.
Stage 1: Query the dimension collections
In the first phase of the star query, I obtain the dimension keys from the dimension collections. For every dimension with a filter, get the dimension keys, with a find() on the dimension, and append a "must" condition to the "compound" of the fact index search.
The following adds the conditions on the Patient (female patients over 170 cm):
search["$search"]["compound"]["must"].push( { in: {
path: "Patient_ID", // Foreign Key in Fact
value: db.DimPatient.find( // Dimension collection
{Gender: "Female", Height: { "$gt": 170 }} // filter on Dimension
).map(doc => doc["Patient_ID"]).toArray() } // Primary Key in Dimension
})
The following adds the conditions on the Doctor (Japanese):
search["$search"]["compound"]["must"].push( { in: {
path: "ResponsibleDoctorID", // Foreign Key in Fact
value: db.DimDoctor.find( // Dimension collection
{"Doctor Nationality": "Japanese" } // filter on Dimension
).map(doc => doc["Doctor_ID"]).toArray() } // Primary Key in Dimension
})
The following adds the condition on the Room (Deluxe):
search["$search"]["compound"]["must"].push( { in: {
path: "RoomKey", // Foreign Key in Fact
value: db.DimRoom.find( // Dimension collection
{"Room Type": "Deluxe" } // filter on Dimension
).map(doc => doc["RoomKey"]).toArray() } // Primary Key in Dimension
})
The following adds the condition on the Disease (Hematology):
search["$search"]["compound"]["must"].push( { in: {
path: "Disease_ID", // Foreign Key in Fact
value: db.DimDisease.find( // Dimension collection
{"Disease Type": "Hematology" } // filter on Dimension
).map(doc => doc["Disease_ID"]).toArray() } // Primary Key in Dimension
})
Finally, the condition on the Insurance coverage (greater than 80%):
search["$search"]["compound"]["must"].push( { in: {
path: "InsuranceKey", // Foreign Key in Fact
value: db.DimInsurance.find( // Dimension collection
{"Coverage Limit": { "$gt": 0.8 } } // filter on Dimension
).map(doc => doc["InsuranceKey"]).toArray() } // Primary Key in Dimension
})
All these search criteria have the same shape: a find() on the dimension collection, with the filters from the query, resulting in an array of dimension keys (like a primary key in a dimension table) that are used to search in the fact documents using it as a reference (like a foreign key in a fact table).
Each of those steps has queried the dimension collection to obtain a simple array of dimension keys, which are added to the aggregation pipeline. Rather than joining tables like in a relational database, the criteria on the dimensions are pushed down to the query on the fact tables.
Stage 2: Query the fact search index
With short queries on the dimensions, I have built the following pipeline search step:
AtlasLocalDev atlas [direct: primary] MedSynoraDW> print(search)
{
'$search': {
index: 'SearchFactEncounterLabTests',
compound: {
must: [
{ in: { path: 'LabTests.LabType', value: 'Lipids' } },
{
in: {
path: 'Patient_ID',
value: [
'TR551', 'TR751', 'TR897', 'TRGT201', 'TRJB261',
'TRQG448', 'TRSQ510', 'TRTP535', 'TRUC548', 'TRVT591',
'TRABU748', 'TRADD783', 'TRAZG358', 'TRBCI438', 'TRBTY896',
'TRBUH905', 'TRBXU996', 'TRCAJ063', 'TRCIM274', 'TRCXU672',
'TRDAB731', 'TRDFZ885', 'TRDGE890', 'TRDJK974', 'TRDKN003',
'TRE004', 'TRMN351', 'TRRY492', 'TRTI528', 'TRAKA962',
'TRANM052', 'TRAOY090', 'TRARY168', 'TRASU190', 'TRBAG384',
'TRBYT021', 'TRBZO042', 'TRCAS072', 'TRCBF085', 'TRCOB419',
'TRDMD045', 'TRDPE124', 'TRDWV323', 'TREUA926', 'TREZX079',
'TR663', 'TR808', 'TR849', 'TRKA286', 'TRLC314',
'TRMG344', 'TRPT435', 'TRVZ597', 'TRXC626', 'TRACT773',
'TRAHG890', 'TRAKW984', 'TRAMX037', 'TRAQR135', 'TRARX167',
'TRARZ169', 'TRASW192', 'TRAZN365', 'TRBDW478', 'TRBFG514',
'TRBOU762', 'TRBSA846', 'TRBXR993', 'TRCRL507', 'TRDKA990',
'TRDKD993', 'TRDTO238', 'TRDSO212', 'TRDXA328', 'TRDYU374',
'TRDZS398', 'TREEB511', 'TREVT971', 'TREWZ003', 'TREXW026',
'TRFVL639', 'TRFWE658', 'TRGIZ991', 'TRGVK314', 'TRGWY354',
'TRHHV637', 'TRHNS790', 'TRIMV443', 'TRIQR543', 'TRISL589',
'TRIWQ698', 'TRIWL693', 'TRJDT883', 'TRJHH975', 'TRJHT987',
'TRJIM006', 'TRFVZ653', 'TRFYQ722', 'TRFZY756', 'TRGNZ121',
... 6184 more items
]
}
},
{
in: {
path: 'ResponsibleDoctorID',
value: [ 830, 844, 862, 921 ]
}
},
{ in: { path: 'RoomKey', value: [ 203 ] } },
{
in: {
path: 'Disease_ID',
value: [
1519, 1506, 1504, 1510,
1515, 1507, 1503, 1502,
1518, 1517, 1508, 1513,
1509, 1512, 1516, 1511,
1505, 1514
]
}
},
{ in: { path: 'InsuranceKey', value: [ 83, 84 ] } }
]
},
sort: { CheckoutDate: -1
}
}
MongoDB Atlas Search indexes, built on Apache Lucene, efficiently handle complex queries with multiple conditions and manage long arrays of values. In this example, a search operation integrates the compound operator with the "must" clause to apply filters across attributes. This capability simplifies query design after resolving complex filters into lists of dimension keys.
With the "search" operation created above, I can run an aggregation pipeline to get the document I'm interested in:
db.FactEncounterLabTests.aggregate([
search,
])
With my example, nine documents are returned in 50 milliseconds.
Estimate the count
This approach is ideal for queries with filters on many conditions, where none are very selective alone, but the combination is highly selective. Using queries on dimensions and a search index on facts avoids reading more documents than necessary. However, depending on the operations you will add to the aggregation pipeline, it is a good idea to estimate the number of records returned by the search index to avoid runaway queries.
Typically, an application that allows users to execute multi-criteria queries may define a threshold and return an error or warning when the estimated number of documents exceeds it, prompting the user to add more filters. For such cases, you can run a "$searchMeta" on the index before a "$search" operation. For example, the following checks that the number of documents returned by the filter is lower than 10,000:
MedSynoraDW>
by Franck Pachot
Murat Demirbas
I attended the TLA+ Community Event at Hamilton, Ontario on Sunday. Several talks pushed the boundaries of formal methods in the real world through incorporating testing, conformance, model translation, and performance estimation. The common thread was that: TLA+ isn't just for specs anymore. It's being integrated into tooling: fuzzers, trace validators, and compilers. The community is building bridges from models to reality, and it's a good time to be in the loop.
Below is a summary of selected talks, followed by some miscellaneous observations. This is just a teaser; the recordings will be posted soon on the TLA+ webpage.
Model-Based Fuzzing for Distributed Systems — Srinidhi Nagendra
Traditional fuzzing relies on random inputs and coverage-guided mutation, and works well for single-process software. But it fails for distributed systems due to too many concurrent programs, too many interleavings, and no clear notion of global coverage.
Srinidhi's work brings model-based fuzzing to distributed protocols using TLA+ models for coverage feedback. The approach, ModelFuzz, samples test cases from the implementation (e.g., Raft), simulates them on the model, and uses coverage data to guide mutation. Test cases are not sequences of messages, but of scheduling choices and failure events. This avoids over-generating invalid traces (e.g., a non-leader sending an AppendEntries).
The model acts as a coverage oracle. But direct enumeration of model executions is infeasible because of too many traces, too much instrumentation, too much divergence from optimized implementations (e.g., snapshotting in etcd). Instead, ModelFuzz extracts traces with lightweight instrumentation as mentioned above, simulates them on the model, and mutates schedules in simple ways: swapping events, crashes, and message deliveries. This turns out to be surprisingly effective. They found 1 new bug in etcd, 2 known and 12 new bugs in RedisRaft. They also showed faster bug-finding compared to prior techniques.
TraceLink: Automating Trace Validation with PGo — Finn Hackett & Ivan Beschastnikh
Validating implementation traces against TLA+ specs is still hard. Distributed systems don't hand you a total order. Logs are huge. Instrumentation is brittle. This talk introduced TraceLink, a toolchain that builds on PGo (a compiler from PlusCal to Go) to automate trace validation.
There are three key ideas. First, compress traces by grouping symbolically and using the binary format. Second, track causality using vector clocks, and either explore all possible event orderings (breadth-first) or just one (depth-first). Third, generate diverse traces via controlled randomness (e.g., injecting exponential backoffs between high-level PlusCal steps).
TraceLink is currently tied to PGo-compiled code, but they plan to support plain TLA+ models. Markus asked: instead of instrumenting with vector clocks, why not just log with a high-resolution global clock? That might work too.
Finn is a MongoDB PhD fellow, and will be doing his summer internship with us at MongoDB Research in the Distributed Systems Research Group.
Translating C to PlusCal for Model Checking — Asterios Tech
Asterios Tech (a Safran defense group subsidiary) works on embedded systems with tiny kernels and tight constraints. They need to verify their scheduler, and manual testing doesn't cut it. So the idea they explore is to translate a simplified C subset to PlusCal automatically, then model check the algorithms for safety to the face of concurrency.
The translator, c2pluscal, is built as a Frama-C plugin. Due to the nature of the embedded programming domain, the C code is limited: no libc, no malloc, no dynamic structures. This simplicity helps in the translation process. Pointers are modeled using a TLA+ record with fields for memory location, frame pointer, and offset. Load/store/macros are mapped to PlusCal constructs. Arrays become sequences. Structs become records. Loops and pointer arithmetic are handled conservatively.
I am impressed that they model pointer arithmetic. This is a promising approach for analyzing legacy embedded C code formally, without rewriting it by hand.
More talks
The "TLA+ for All: Notebook-Based Teaching" talk introduced Jupyter-style TLA+ notebooks for pedagogy supporting inline explanations, executable specs, and immediate feedback.
I presented the talk "TLA+ Modeling of MongoDB Transactions" (joint work with Will Schultz). We will post a writeup soon.
Jesse J. Davis presented "Are We Serious About Using TLA+ For Statistical Properties?". He plans to blog about it.
Andrew Helwer presented "It’s never been easier to write TLA⁺ tooling!", and I defer to his upcoming blog post as well.
Markus Kuppe, who did the crux of organizing the event, demonstrated that GitHub Copilot can solve the Diehard problem with TLA+ in 4 minutes of reasoning, with some human intervention. He said that the TLA+ Foundation and NVidia is funding the "TLAI" challenge, for exploring novel AI augmentation of TLA+ modeling.
Miscellaneous
The 90-minute lunch breaks were very European. A U.S. conference would cap it at an hour, and DARPA or NSF would eliminate it entirely: brown bag through talks. The long break was great for conversation.
In our workshop, audience questions were frequent and sharp. We are a curious bunch.
The venue was McMaster University in Hamilton, 90 minutes drive from home. Border crossings at the Lewiston-Queenston bridge were smooth without delays. But questions from border officers still stressed my daughters (ages 9 and 13). I reminded them how much worse we had it when we had visas, instead of the US citizenship.
My daughters also noticed how everything (roads, buildings, parks) is called Queen's this and Queen's that. My 9th year old tried to argue that since Canada is so close to US and since it looks so similar to US, it feels more like a U.S. state than a separate country. Strong Trump vibes with that one.
USD to CAD exchange rate is $1.38. I still remembered them to be pretty much on par, so I was surprised. We hadn’t visited Canada since 2020. A Canadian friend told me there's widespread discontent about the economy, rent and housing prices.
Canadians are reputed to be very nice. But drivers were aggressive—cutting in, speeding in mall lots. I also received tense, passive-aggressive encounters from two cashiers and a McMaster staff. Eh.
by Murat (noreply@blogger.com)
Percona Database Performance Blog
Recognizing a gap in the availability of straightforward tools for MongoDB benchmarking, particularly those that do not require complex compilation, configuration, and intricate setups, I developed MongoDB Workload Generator. The aim was to provide MongoDB users with an accessible solution for effortlessly generating realistic data and simulating application workloads on both sharded and non-sharded clusters. […]
by Daniel Almeida
Tinybird Engineering Blog
Introducing Explorations, a conversational UI to chat with your real-time data in Tinybird.
Tinybird Engineering Blog
We just launched a conversational AI feature. Here's how we built the feature that lets you chat with your data.
Tinybird Engineering Blog
We just launched a conversational AI feature. Here's how we built the feature that lets you chat with your data.
by Rafa Moreno
Tinybird Engineering Blog
Introducing Explorations, a conversational UI to chat with your real-time data in Tinybird.
by Javi Santana