a curated list of database news from authoritative sources

January 05, 2026

3 Steps to Optimize Your Queries for Speed

When performance matters, how you store your application objects, write queries, and design indexes can be the difference between scanning a collection for minutes and getting results in milliseconds—without touching a single document.

In this article, we’ll walk through an example in MongoDB. We’ll start with a normalized model, move to an embedded model, add an index to avoid a full collection scan, and then refine both the index and query to read only what’s needed for the result. The execution plan and execution statistics will guide us at each step.

Scenario

Our domain is an e‑commerce order‑entry application with customers, products, orders, and order lines. We want to:

Find the most recent order in Switzerland containing the book “MongoDB Data Modeling and Schema Design”.

Initial model: normalized with references

If we build the data model without considering how applications will access it—or simply migrate from a relational database by creating one collection per table—we end up with a model where a single domain aggregate is split across multiple collections.

In this model, we insert a customer, some products (books), an order for this customer, and the corresponding order lines for those books:

// Customer  
db.customers.insertOne({  
  "_id": "C0001",  
  "name": "Franck Pachot",  
  "email": "franck.pachot@example.com",  
  "address": { "city": "Geneva", "country": "Switzerland" }  
})  

// Products  
db.products.insertOne({  
  "_id": "P0001",  
  "title": "MongoDB in Action, Third Edition: Building on the Atlas Data Platform",  
  "author": "Arek Borucki",  
  "price": 59.99, "category": "Database", "format": "Paperback"  
})  
db.products.insertOne({  
  "_id": "P0002",  
  "title": "MongoDB Data Modeling and Schema Design",  
  "author": ["Daniel Coupal", "Pascal Desmarets", "Steve Hoberman"],  
  "price": 54.99, "category": "Database", "format": "Paperback"  
})  
db.products.insertOne({  
  "_id": "P0003",  
  "title": "High Performance with MongoDB: Best Practices for Performance Tuning, Scaling, and Architecture",  
  "author": ["Asya Kamsky", "Ger Hartnett", "Alex Bevilacqua"],  
  "price": 49.99, "category": "Database", "format": "Paperback"  
})  

// Order  
db.orders.insertOne({  
  "_id": "O0001",  
  "customerId": "C0001",  
  "orderDate": ISODate("2024-10-22T09:00:00Z"),  
  "status": "Processing"  
})  

// Order lines  
db.orderlines.insertMany([  
  { "orderId": "O0001", "productId": "P0001", "quantity": 1 },  
  { "orderId": "O0001", "productId": "P0002", "quantity": 1 },  
  { "orderId": "O0001", "productId": "P0003", "quantity": 1 }  
])

This is a classic relational database pattern, but querying and optimizing it is complex because joins must occur before filtering.

In MongoDB, to rebuild the domain object aggregate, we use an aggregation pipeline with $lookup and $match:

db.orders.aggregate([  
  // Get customer info to filter by country  
  { $lookup: {  
      from: "customers",  
      localField: "customerId",  
      foreignField: "_id",  
      as: "customer"  
    }},  
  { $unwind: "$customer" },  

  // Get product info to filter by title  
  { $lookup: {  
      from: "orderlines",  
      let: { orderIdVar: "$_id" },  
      pipeline: [  
        { $match: { $expr: { $eq: ["$orderId", "$$orderIdVar"] } } },  
        { $lookup: {  
            from: "products",  
            localField: "productId",  
            foreignField: "_id",  
            as: "product"  
          }},  
        { $unwind: "$product" }  
      ],  
      as: "orderLines"  
    }},  

  // Apply filters  
  { $match: {  
      "customer.address.country": "Switzerland",  
      "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
    }},  

  // Sort and limit  
  { $sort: { orderDate: -1 } },  
  { $limit: 1 }  
])  

This query works, but it scans the entire collection. Indexes aren’t effective here because $match runs only after the $lookup and $unwind stages.

With one million orders, it would read all one million, perform three million lookups, filter on country and book title, sort a large intermediate result, and then return just one document.

That’s why joins can be slow: when data is split across many tables or collections, you lose the benefit of efficient index access—compound indexes can only target one collection. It also requires to code every query in the domain repository.

We need to rethink the data model.

Step 1: Embedding to fit the domain model

Knowing the database is for an order‑entry system, we can design a document model that matches the domain. An order should be a self‑contained document with all customer info and complete product details for each order line—because that’s how it’s displayed, printed, or emailed. If it’s a single logical document on paper, it should be a single document in MongoDB.

Here’s the embedded form:

db.orders.insertOne({  
  _id: 'O0002',  
  customerId: 'C0001',  
  orderDate: ISODate('2024-10-22T09:00:00.000Z'),  
  status: 'Processing',  
  customer: {  
       _id: 'C0001',  
       name: 'Franck Pachot',  
       email: 'franck.pachot@example.com',  
       address: { city: 'Geneva', country: 'Switzerland' }  
  },  
  orderLines: [  
       { quantity: 1, product: { _id: 'P0001', title: "'MongoDB in Action...', price: 59.99 } },  "
       { quantity: 1, product: { _id: 'P0002', title: "'MongoDB Data Modeling and Schema Design', price: 54.99 } },  "
       { quantity: 1, product: { _id: 'P0003', title: "'High Performance with MongoDB...', price: 49.99 } }  "
  ]  
})

Duplicated data isn’t a concern here—documents are compressed on disk, and changes to product descriptions don’t affect completed orders.

Querying becomes far simpler:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  }  
).sort({ orderDate: -1 }).limit(1)

No lookups needed. But it still does a collection scan, filters, sorts, and finally returns one document.

I add one million documents and run the query with execution statistics:

db.orders.insertMany(Array.from({length: 333333},()=>({customer:{address:{country:"Switzerland"}}})))
db.orders.insertMany(Array.from({length: 666666},()=>({customer:{address:{country:"Germany"}}})))

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  }  
).sort({ orderDate: -1 }).limit(1).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 559,
  totalKeysExamined: 0,
  totalDocsExamined: 1000001,
  executionStages: {
    isCached: false,
    stage: 'SORT',
    nReturned: 1,
    executionTimeMillisEstimate: 542,
    works: 1000004,
    advanced: 1,
    needTime: 1000002,
    needYield: 0,
    saveState: 29,
    restoreState: 29,
    isEOF: 1,
    sortPattern: { orderDate: -1 },
    memLimit: 104857600,
    limitAmount: 1,
    type: 'simple',
    totalDataSizeSorted: 0,
    usedDisk: false,
    spills: 0,
    spilledRecords: 0,
    spilledBytes: 0,
    spilledDataStorageSize: 0,
    inputStage: {
      stage: 'COLLSCAN',
      filter: {
        '$and': [
          { 'customer.address.country': { '$eq': 'Switzerland' } },
          {
            'orderLines.product.title': { '$eq': 'MongoDB Data Modeling and Schema Design' }
          }
        ]
      },
      nReturned: 1,
      executionTimeMillisEstimate: 516,
      works: 1000002,
      advanced: 1,
      needTime: 1000000,
      needYield: 0,
      saveState: 29,
      restoreState: 29,
      isEOF: 1,
      direction: 'forward',
      docsExamined: 1000001
    }
  }
}

In order to return a single document (nReturned: 1), one million documents were read (totalDocsExamined: 1000001), and sorted (stage: 'SORT', works: 1000004).

Now we we have all in one collection, we can optimize with an index. Our objective is to reduce totalDocsExamined to nReturned.

Step 2: Indexing for filtering, sorting, and pagination

Our query filters by:

  • Equality on customer country
  • Equality on product title
  • A range on order date (due to sort().limit()).

An ideal compound index:

db.orders.createIndex({  
  "customer.address.country": 1,   // equality  
  "orderLines.product.title": 1,   // equality  
  "orderDate": 1                   // sort/range  
})  

Here is the execution plan:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  }  
).sort({ orderDate: -1 }).limit(1).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 1,
  totalDocsExamined: 1,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 1,
    executionTimeMillisEstimate: 0,
    works: 2,
    advanced: 1,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    limitAmount: 1,
    inputStage: {
      stage: 'FETCH',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 1,
      advanced: 1,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 0,
      docsExamined: 1,
      alreadyHasObj: 0,
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 1,
        executionTimeMillisEstimate: 0,
        works: 1,
        advanced: 1,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 0,
        keyPattern: {
          'customer.address.country': 1,
          'orderLines.product.title': 1,
          orderDate: 1
        },
        indexName: 'customer.address.country_1_orderLines.product.title_1_orderDate_1',
        isMultiKey: true,
        multiKeyPaths: {
          'customer.address.country': [],
          'orderLines.product.title': [ 'orderLines' ],
          orderDate: []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'backward',
        indexBounds: {
          'customer.address.country': [ '["Switzerland", "Switzerland"]' ],
          'orderLines.product.title': [
            '["MongoDB Data Modeling and Schema Design", "MongoDB Data Modeling and Schema Design"]'
          ],
          orderDate: [ '[MaxKey, MinKey]' ]
        },
        keysExamined: 1,
        seeks: 1,
        dupsTested: 1,
        dupsDropped: 0
      }
    }
  }
}

With this index, the query jumps straight to the target document:

  • totalKeysExamined: 1 from one range (seeks: 1 in stage: 'IXSCAN)
  • totalDocsExamined: 1 from stage: 'FETCH' without the need to sort
  • nReturned: 1

The result is retrieved in sorted order directly from the index, minimizing the number of documents to read. We can do better. Our next goal is to elimiate totalDocsExamined.

Step 3: Projection and covering index

If the application only needs the customer’s name and country, we can project:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  },  
  { "customer.name": 1, "customer.address.country": 1, _id: 0 }  
)  

[
  {
    customer: { name: 'Franck Pachot', address: { country: 'Switzerland' } }
  }
]

Now, we can create a covering index:

db.orders.createIndex({  
  "customer.address.country": 1,  
  "orderLines.product.title": 1,  
  "orderDate": 1,  
  "customer.name": 1  
})  

Here is the execution plan:

db.orders.find(  
  {  
    "customer.address.country": "Switzerland",  
    "orderLines.product.title": "MongoDB Data Modeling and Schema Design"  
  },  
  { "customer.name": 1, "customer.address.country": 1, _id: 0 }  
).limit(1).explain("executionStats").executionStats

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 1,
  totalDocsExamined: 0,
  executionStages: {
    isCached: false,
    stage: 'LIMIT',
    nReturned: 1,
    executionTimeMillisEstimate: 0,
    works: 2,
    advanced: 1,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    limitAmount: 1,
    inputStage: {
      stage: 'PROJECTION_DEFAULT',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 1,
      advanced: 1,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 0,
      transformBy: { 'customer.name': 1, 'customer.address.country': 1, _id: 0 },
      inputStage: {
        stage: 'IXSCAN',
        nReturned: 1,
        executionTimeMillisEstimate: 0,
        works: 1,
        advanced: 1,
        needTime: 0,
        needYield: 0,
        saveState: 0,
        restoreState: 0,
        isEOF: 0,
        keyPattern: {
          'customer.address.country': 1,
          'orderLines.product.title': 1,
          orderDate: 1,
          'customer.name': 1
        },
        indexName: 'customer.address.country_1_orderLines.product.title_1_orderDate_1_customer.name_1',
        isMultiKey: true,
        multiKeyPaths: {
          'customer.address.country': [],
          'orderLines.product.title': [ 'orderLines' ],
          orderDate: [],
          'customer.name': []
        },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: {
          'customer.address.country': [ '["Switzerland", "Switzerland"]' ],
          'orderLines.product.title': [
            '["MongoDB Data Modeling and Schema Design", "MongoDB Data Modeling and Schema Design"]'
          ],
          orderDate: [ '[MinKey, MaxKey]' ],
          'customer.name': [ '[MinKey, MaxKey]' ]
        },
        keysExamined: 1,
        seeks: 1,
        dupsTested: 1,
        dupsDropped: 0
      }
    }
  }
}

This is the optimal plan where all documents returned (nReturned: 1) come from the index entries that were scanned (totalKeysExamined: 1) without any FETCH (totalDocsExamined: 0)

Conclusion

MongoDB query performance isn’t just about adding indexes—it’s about aligning your data model and queries with how your application really uses data. Turning a four-collection join with millions of lookups into a single index scan that runs in milliseconds comes down to three rules:

  1. Model for Access Embed data that’s accessed together. If it belongs on one piece of paper—like an order with its customer and product details—it belongs in one document.
  2. Index for Your Query Use compound indexes that follow your filter, sort, and projection. Equality fields first, range fields last. This turns full scans into targeted index scans, cutting totalDocsExamined from millions to one.
  3. Project Only What You Need Return only required fields. When a query is fully covered by an index, MongoDB can achieve totalDocsExamined: 0 and return the expected result fast.

You don’t need to apply this to all queries—use these steps only when necessary. Excessive lookups and limited indexing options drove us to optimize the data model. After that, the remaining need for selective filters and the absence of a suitable index led to creating ... (truncated)

Are Database System Researchers Making Correct Assumptions about Transaction Workloads?

In this blog, we had reviewed quite a number of deterministic database papers, including Calvin, SLOG, Detock, which aimed to achieve higher throughput and lower latency. The downside of these systems is sacrificing transaction expressivity. They rely on two critical assumptions: first, that transactions are "non-interactive", meaning they are sent as a single request (one-shot) rather than engaging in a multi-round-trip conversation with the application, and second, that the database can know a transaction's read/write set before execution begins (to lock data deterministically).

So when these deterministic database researchers write a paper to validate how these assumptions hold in the real world, we should be skeptical and cautious in our reading. Don't get me wrong, this is a great and valuable paper. And we still need to be critical in our reading. 


Summary

The study employed a semi-automated annotation tool to analyze 111 popular open-source web applications and over 30,000 transactions. The authors target applications built with Django (Python) and TypeORM (TypeScript). These Object-Relational Mappers (ORMs) abstract away the SQL layer, allowing developers to interact with a wide variety of underlying relational databases (such as PostgreSQL, MySQL, and SQLite) using standard application code. The authors categorized transactions based on two primary dimensions: Transaction Interactivity and  Read/Write Set Inferability. 


Transaction Interactivity

The study found that interactive transactions, where the application holds a transaction open while performing logic or reads, are surprisingly rare. 39% of applications contain no interactive transactions. For the remaining 61% of applications, only roughly 9.6% of the workload is interactive.


The authors then distinguish between "Strictly Interactive" transactions, which inherently require external input or side effects mid-flight (e.g., waiting for an API response or user prompt), and "Non-Strictly Interactive" ones, where the back-and-forth is merely deterministic application logic processing reads between queries without outside dependencies. They find strictly interactive transactions account for only 0.5% of the total workload. They argue that the vast majority of remaining interactive transactions are convertible to one-shot transactions with minimal code changes.


Read/Write Set Inferability

The authors tested whether a database could statically predict which rows a transaction would touch. They found that for 90% of applications, at least 58% of transactions have read/write sets that are fully inferable in advance.

The primary obstacle to perfect knowledge is the CDA Mismatch (Conflict Detection Attribute Mismatch), which occurs in roughly 27% of transactions. This happens when a transaction queries data using a secondary attribute (e.g., 'email') rather than the primary locking key (e.g., 'ID'). See Figure 1 above. When this happens, the database cannot infer the lock without first inspecting the data. However, the study remains upbeat about this. Since CDA mismatches typically occur in simple single-statement transactions, they say that the "fallback" mechanism (running a lightweight reconnaissance query to find the ID) is inexpensive.



Critical Review

This paper provides very valuable empirical data on transaction workloads by surveying a large number of applications. It echoes the scale of the SIGMOD 2015 "Feral Concurrency Control" study, which examined the integrity risks of the mismatch between databases and modern ORM frameworks like Rails. However, we must examine this study critically to identify potential limitations in scope and bias. My two big criticism are the following. 

Selection Bias. The choice of the corpus, open-source web applications using ORMs, heavily skews the results and excludes almost all enterprise systems. The study also ignores ad-hoc transactions executed by human operators (DBAs, analysts) via terminals. These transactions are inherently interactive ("user think time") and often performance-critical.

Convertibility Optimism. The paper argues that "Non-Strictly Interactive" transactions are easily convertible to deterministic models without resorting to brittle Stored Procedures. They advocate for reconnaissance/scout query pattern, where the application performs a dry run read phase to optimistically build a write-set before sending it as a single atomic batch. While this is promising, the claim of minimal changes seems very optimistic. However, I think this conversion challenge is a great opportunity for research and for collecting empirical evidence of the actual engineering costs.

JavaScript Stored Routines in Percona Server for MySQL: A New Era for Database Programmability

For decades, we’ve accepted a painful compromise: if you wanted logic inside the database, you had to write SQL/PSM (Persistent Stored Modules). It’s clunky, hard to debug, and declarative by nature, making it terrible for algorithmic tasks. That ends with Percona Server 8.4.7-7. We are introducing JS Stored Programs as a Tech Preview. Unlike Oracle’s […]

January 04, 2026

Too Close to Our Own Image?

Recent work suggests we may be projecting ourselves onto LLMs more than we admit.

A paper in Nature reports that GPT-4 exhibits "state anxiety". When exposed to traumatic narratives (such as descriptions of accidents or violence), the model's responses score much higher on a standard psychological anxiety inventory. The jump is large, from "low anxiety" to levels comparable to highly anxious humans. The same study finds that therapy works: mindfulness-style relaxation prompts reduce these scores by about a third, though not back to baseline. The authors argue that managing an LLM's emotional state may be important for safe deployment, especially in mental health settings and perhaps in other mission-critical domains.

Another recent paper argues that LLMs can develop a form of brain rot. Continual training on what the authors call junk data (short, viral, sensationalist content typical of social media) leads to models developing weaker reasoning, poorer long-context handling, and worse safety compliance. Those models also score higher on dark personality traits like narcissism and psychopathy, as measured by the TRAITS benchmark. The primary mechanism behind this decline was identified as thought-skipping, where the model truncates its own reasoning chains, echoing the fragmented attention patterns associated with internet addiction in humans.

In my previous post, I discussed how agentic system techniques parallel self-improvement advice. I noted that successful agents rely on habits that look suspiciously like human self-help techniques. These models write things down, using scratchpads and memory buffers to offload working memory. They think in loops, iterating through write-reason-repeat cycles to decompose hard problems. They role-play, adopting constrained personas like "The Critic" or "The Engineer" to constrain the search and behavior space, much like the alter ego effect in humans.

This lets me think, would we need psychologists and life coaches for AI in the future? Would this be a vocation/career?

If models can suffer from anxiety that responds to mindfulness prompts, and from cognitive decline that responds to changes in their data diet, we are already treating them as if they are biological minds rather than machines. Perhaps the "Prompt Engineer" of today is just the precursor to the "AI Psychologist" of tomorrow, a professional whose job is to keep it sane, focused, and "healthy".

findOneAndUpdate(): UPDATE ... RETURNING in MongoDB, Idempotent and ACID, with

You can implement atomic read-write operations in a single call instead of an explicit transaction, reducing both round trips and the window for concurrent conflicts. In PostgreSQL, use UPDATE ... RETURNING instead of SELECT FOR UPDATE followed by UPDATE. In MongoDB, use findOneAndUpdate() instead of updateOne() followed by find(). This enables a single ACID read-write operation that is failure-resilient and safely retryable in MongoDB because it is idempotent. Here is an example.

Demo: two withdrawals and a receipt

Bob withdraws using a debit card (no negative balance allowed). The application first does updateOne to subtract the amount only if the balance is enough, then does a separate find() to print the receipt with the balance. Between those two calls, Alice uses a credit card (allowing a negative balance) and withdraws money from the same account, so Bob's printed balance becomes inconsistent with the balance checked for his withdrawal.

To fix this, we'll use findOneAndUpdate() to return the post-update balance atomically with the withdrawal.

Connection and collection

Here’s a Python + PyMongo program simulating this race condition and showing the fix:

from pymongo import MongoClient, ReturnDocument
import threading
import time

# Connect to MongoDB
client = MongoClient("mongodb://127.0.0.1:27017/?directConnection=true")
db = client.bank
accounts = db.accounts

# Prepare test account: Bob & Alice share this account
def reset_account():
    accounts.drop()
    accounts.insert_one({
        "_id": "b0b4l3ce",
        "owner": ["Bob", "Alice"],
        "balance": 100
    })

reset_account()

I have only one document for this example:

bank> db.accounts.find()

[ { _id: 'b0b4l3ce', owner: [ 'Bob', 'Alice' ], balance: 100 } ]

The balance is set at 100.

Scenario with updateOne() followed by find()

Bob's withdrawal process:


def bob_withdraw_debit(amount): 

    print("[Bob] Attempting debit withdrawal", amount)  
    # Application logic checks balance then updates  
    result = accounts.update_one(  
        {"_id": "b0b4l3ce", "balance": {"$gte": amount}},   # must have enough money  
        {"$inc": {"balance": -amount}}  
    )  

    # If no document was updated, the filter didn't find enough in the amount
    if result.modified_count == 0:  
        print("[Bob] Withdrawal denied - insufficient funds")  
        return  

    # Simulate some processing delay before printing the ticket - to be sure to reproduce the race condition
    time.sleep(1)

    # query the balance for the receipt
    balance = accounts.find_one({"_id": "b0b4l3ce"})["balance"]  
    print(f"[Bob] Debit card ticket: withdrew {amount}, balance after withdrawal: {balance}")  

Alice's withdrawal process:

# Alice withdraws using credit card (negative balance allowed)  
def alice_withdraw_credit(amount, delay=0):  
    time.sleep(delay)  # let Bob start first  
    print("[Alice] Attempting credit withdrawal", amount) 
    # There's no balance check for credit cards
    accounts.update_one(  
        {"_id": "b0b4l3ce"},  
        {"$inc": {"balance": -amount}}  
    )  
    print("[Alice] Credit withdrawal complete")  

Demo script to run in two threads and get the race condition:

def demo():
    reset_account()
    t_bob = threading.Thread(target=bob_withdraw_debit, args=(80,))
    t_alice = threading.Thread(target=alice_withdraw_credit, args=(30, 0.5))  # starts just after Bob update
    t_bob.start()
    t_alice.start()
    t_bob.join()
    t_alice.join()

The output is:

>>> demo()

[Bob] Attempting debit withdrawal 80
[Alice] Attempting credit withdrawal 30
[Alice] Credit withdrawal complete
[Bob] Ticket: withdrew 80, balance after withdrawal: -10

Bob received a ticket showing a withdrawal with a negative balance from his debit card — this is a bug ❌

We must get the balance at the time of the update, before, and subtract the amount, or after.

Scenario with findOneAndUpdate() and returnDocument: "after"

Bob's withdrawal process:

def bob_withdraw_debit(amount):  
    print("[Bob] Attempting debit withdrawal", amount)  
    doc = accounts.find_one_and_update(  
        {"_id": "b0b4l3ce", "balance": {"$gte": amount}},  
        {"$inc": {"balance": -amount}},  
        return_document=ReturnDocument.AFTER  # get post-update document atomically  
    )  
    # No need to check the update count, we have the document if it was updated
    if not doc:  
        print("[Bob] Withdrawal denied - insufficient funds")  
        return  
    # Ticket immediately shows consistent balance  
    print(f"[Bob] Ticket: withdrew {amount}, balance after withdrawal: {doc['balance']}")

Running the demo again:

>>> demo()

[Bob] Attempting debit withdrawal 80
[Bob] Ticket: withdrew 80, balance after withdrawal: 20
[Alice] Attempting credit withdrawal 30
[Alice] Credit withdrawal complete

Bob received a ticket showing the balance at the exact time of withdrawal ✅

The update write and post-update read occurred as a single atomic operation on the document, leaving no opportunity for another write between the update and the displayed read result.

After-image is saved for consistency and resilience

In MongoDB, reads and writes do not acquire transaction locks like they can in relational databases, but document updates are still atomic at the document level, even without staring an explicit transaction. MongoDB uses lightweight document-level locks internally to ensure ACID guarantees for a single document, since even one update can involve multiple internal reads and writes, such as enforcing unique constraints and updating indexes.

updateOne() returns only metadata (such as the number of documents updated), whereas findOneAndUpdate() returns the updated document itself, with the read and write occurring in the same atomic operation at the single-document level. This atomicity is preserved even in the event of failures.

If the network is lost or the primary crashes and a secondary is promoted, the MongoDB driver retries the operation as part of retryable writes. Because retries must be idempotent, find_one_and_update() returns the same document image on retry.

To support this, MongoDB stores a document image—either the after image (as in this example, using returnDocument: "after") or the before image—in an internal system collection (config.image_collection) that is replicated independently of the oplog, as part of the same transaction:

bank> use config

config> db.image_collection.find()
[
  {
    _id: {
      id: UUID('d04e10d6-c61d-42ad-9a44-5bb226a898d8'),
      uid: Binary.createFromBase64('47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=', 0)
    },
    txnNum: Long('15'),
    ts: Timestamp({ t: 1767478784, i: 5 }),
    imageKind: 'postImage',
    image: { _id: 'b0b4l3ce', owner: [ 'Bob', 'Alice' ], balance: 20 },
    invalidated: false
  }
]
config>

When you enable write retries, this image collection is used internally to make write operations resilient to failures, which are handled transparently by the database driver and server, not by the application, and with the strongest consistency.

Comparison with PostgreSQL

In PostgreSQL, you would use the following queries to do the equivalent:

CREATE TABLE accounts (  
    id TEXT PRIMARY KEY,  
    balance NUMERIC,  
    owners TEXT[]  
);  
INSERT INTO accounts VALUES ('b0b4l3ce', ARRAY['Bob','Alice'], 100);  

-- Alice transaction  
UPDATE accounts    
SET balance = balance - 30    
WHERE id = 'shared_account';  

-- Bob transaction  
UPDATE accounts    
SET balance = balance - 80    
WHERE id = 'b0b4l3ce' AND balance >= 80    
RETURNING balance AS new_balance;

PostgreSQL drivers do not automatically retry failures and rely on MVCC and locks to ensure ACID properties.

With Repeatable Read isolation—appropriate because the write (SET balance = balance - 80) depends on the read (WHERE balance >= 80)—Bob's transaction takes a snapshot at the start and, when run concurrently with Alice's, still sees balance = 100. If Alice commits first and reduces the balance to 70, Bob's transaction fails with: ERROR: could not serialize access due to concurrent update. The application must handle this by retrying the entire transaction. The driver does not do this automatically.

With the default Read Committed isolation level, Bob’s transaction waits if Alice’s update has locked the row. After Alice commits, PostgreSQL reevaluates Bob’s WHERE clause. The balance is now 70, which no longer satisfies balance >= 80, so the UPDATE affects 0 rows and the withdrawal is refused, preventing a negative balance. This is fine when only one row is affected, but in a multi-row statement it could be inconsistent, operating on rows from different transactional states.

Conclusion

MongoDB supports both multi-document transactions and single-document atomic operations, but strongly encourages using single-document operations when possible. If you design your schema so that business logic fits in a single document, findOneAndUpdate() can perform conditional checks, apply updates, and return the updated document atomically, avoiding race condition anomalies and enabling idempotent retries.

In some cases—such as the one described in the earlier blog post on FOR UPDATE SKIP LOCKED in MongoDB—an updateOne() followed by a find() with appropriate conditions is enough and avoids storing before- or after-images.

Databases in 2025: A Year in Review

The world tried to kill Andy off but he had to stay alive to to talk about what happened with databases in 2025.

January 02, 2026

The Agentic Self: Parallels Between AI and Self-Improvement

2025 was the year of the agent. The goalposts for AGI shifted; we stopped asking AI to merely "talk" and demanded that it "act". As an outsider looking at the architecture of these new agents and agentic system, I noticed something strange. The engineering tricks used to make AI smarter felt oddly familiar. They read less like computer science and more like … self-help advice.

The secret to agentic intelligence seems to lie in three very human habits: writing things down, talking to yourself, and pretending to be someone else. They are almost too simple.

The Unreasonable Effectiveness of Writing

One of the most profound pieces of advice I ever read as a PhD student came from Prof. Manuel Blum, a Turing Award winner. In his essay "Advice to a Beginning Graduate Student", he wrote: "Without writing, you are reduced to a finite automaton. With writing you have the extraordinary power of a Turing machine."

If you try to hold a complex argument entirely in your head, you will fail. Your working memory is a "finite automaton”, it has short limits. But if you write things down, you offload that memory to the paper. You can then reference it, critique it, and build on it. The paper becomes your external hard drive.

AI Agents are now being built on this exact principle. An LLM has a limited context window, its effective attention span is limited. If it tries to solve a 50-step coding problem in one breath, it will crash and burn. To fix this, we give agents scratchpads. We force them to write down a plan before they execute code. We give them memory buffers to store facts for later. We are essentially asking AI to take notes. By externalizing their internal state onto a digital piece of paper, agents evolve from simple pattern-matchers into robust thinkers.

Thinking is Just Talking to Yourself in a Loop

For a long time, we treated LLMs as simple input-output machines: we asked a question, and it produced an answer. Often, the answers were unsatisfying—hallucinated or superficial. DeepSeek changed this by pausing the model before it answered, prompting it to “think” first. But what does thinking mean for a computer? It means generating text that remains hidden from the user: an internal monologue. In that sense, it mirrors human thought.

“Writing is nature’s way of letting you know how sloppy your thinking is.” — Guindon

“If you think without writing, you only think you’re thinking.” — Leslie Lamport

This brings us back to the unreasonable effectiveness of writing. But this time performed in a loop. Thinking isn’t instantaneous, it is a process. Prototyping for the win! The Agent follows a loop: Act/Write → Reason → Repeat. It talks to itself, breaking complex problems into manageable parts. It asks, “Wait, should I check this assumption?” or “That doesn’t look right; let me try again.” 

Role Playing: The Alter Ego Effect

A few years ago, I read The Alter Ego Effect: The Power of Secret Identities to Transform Your Life. Its core idea is that adopting a persona can unlock hidden capabilities. By stepping into a specific character, you bypass your own inhibitions and trigger a defined set of behaviors. Beyoncé had “Sasha Fierce” on stage, and most athletes have their “game faces”. The book made a convincing case, but this sounded so cheesy that I didn’t even blog about it at the time.

I guess that stuff actually works! In the world of AI Agents, they call this "Role Prompting" or "Multi-Agent Systems".

If you ask a single AI to "write code," it might do a mediocre job. But if you assign one AI the role of "The Architect" to plan the code, another the role of "The Engineer" to write it, and a third the role of "The Critic" to review it, the result is exponentially better.

These personas are not just theatrics; they are inductive biases. They constrain the search space. Just as adopting the persona of a "strict editor" helps a writer cut fluff, assigning an AI the role of a "debugger" forces it to look for errors rather than just generating text.

Advisor models act as meta-thinkers. They don’t execute tasks directly but monitor other agents, flag risks, and guide decisions. In coding, an Advisor might warn the Architect of a risky design or nudge the Engineer away from error-prone code. By providing this high-level oversight, Advisors let agents focus on immediate tasks while keeping long-term goals in view, making multi-agent systems more strategic.

Isn’t this just the classic advice-column cliche: "get a coach"?

Where do we go from here?

Maybe we are just discovering that the best way to utilize pattern-matching-based raw intelligence is through a simple universal workflow: Write it down, talk it through or try it out, and check your work. These steps may provide the minimal scaffolding required to sustain reasoning, forming the core mechanics of thought.

Lamport offers a corollary to Guindon:
Mathematics is nature's way of showing how sloppy our writing is.

Mathematics closes the loopholes that prose leaves open. What writing does to thought, mathematics does to writing. This is likely where symbolic AI and formal methods become important. Writing a specification already sharpens thinking. Formalizing it sharpens it again and makes reasoning as precise and reliable as possible.

Common prefix skipping, adaptive sort

The patent expired for US7680791B2. I invented this while at Oracle and it landed in 10gR2 with claims of ~5X better performance vs the previous sort algorithm used by Oracle. I hope for an open-source implementation one day. The patent has a good description of the algorithm, it is much easier to read than your typical patent. Thankfully the IP lawyer made good use of the functional and design docs that I wrote.

The patent is for a new in-memory sort algorithm that needs a name. Features include:

  • common prefix skipping
    • skips comparing the prefix of of key bytes when possible
  • adaptive
    • switches between quicksort and most-significant digit radix sort
  • key substring caching
    • reduces CPU cache misses by caching the next few bytes of the key
  • produces results before sort is done
    • sorted output can be produced (to the rest of the query, or spilled to disk for an external sort) before the sort is finished. 

How it came to be

From 2000 to 2005 I worked on query processing for Oracle. I am not sure why I started on this effort and it wasn't suggested by my bosses or peers. But the Sort Benchmark contest was active and I had more time to read technical papers. Perhaps I was inspired by the Alphasort paper.

While the Sort Benchmark advanced the state of the art in sort algorithms, it also encouraged algorithms that were great for benchmarks (focus on short keys with uniform distribution). But keys sorted by a DBMS are often much larger than 8 bytes and adjacent rows often have long common prefixes in their keys.

So I thought about this while falling to sleep and after many nights realized that with a divide and conquer sort, as the algorithm descends into subpartitions of the data, that the common prefixes of the keys in each subpartition were likely to grow:

  • were the algorithm able to remember the length of the common prefix as it descends then it can skip the common prefix during comparisons to save on CPU overhead
  • were the algorithm able to learn when the length of the common prefix grows then it can switch from quicksort to most-significant digit (MSD) radix sort using the next byte beyond the common prefix and then switch back to quicksort after doing that
  • the algorithm can cache bytes from the key in an array, like Alphasort. But unlike Alphasort as it descends it can cache the next few bytes it will need to compare rather than only caching the first few bytes of the key. This provides much better memory system behavior (fewer cache misses).
Early implementation

This might have been in 2003 before we were able to access work computers from home. I needed to get results that would convince management this was worth doing. I started my proof-of-concept on an old PowerPC based Mac I had at home that found a second life after I installed Yellow Dog Linux on it.

After some iteration I had good results on the PowerPC. So I brought my source code into work and repeated the test on other CPUs that I could find. On my desk I had a Sun workstation and a Windows PC with a 6 year old Pentium 3 CPU (600MHz, 128kb L2 cache). Elsewhere I had access to a new Sun server with a 900MHz UltraSPARC IV (or IV+) CPU and an HP server with a PA RISC CPU.

I also implemented other state of the art algorithms including Alphasort along with the old sort algorithm used by Oracle. From testing I learned:
  1. my new sort was much faster than other algorithms when keys were larger than 8 bytes
  2. my new sort was faster on my old Pentium 3 CPU than on the Sun UltraSPARC IV
The first was great news for me, the second was less than great news for Sun shareholders. I never learned why that UltraSPARC IV performance was lousy. It might have been latency to the caches.

Real implementation

Once I had great results, it was time for the functional and design specification reviews. I remember two issues:
  • the old sort was stable, the new sort was not
    • I don't remember how this concern was addressed
  • the new sort has a bad, but unlikely, worst-case
    • The problem here is the worst-case when quicksort picks the worst pivot every time it selects a pivot. The new sort wasn't naive, it used the median from a sample of keys each time to select a pivot (the sample size might have been 5). So I did the math to estimate the risk. Given that the numbers are big and probabilities are small I needed a library or tool that supported arbitrary-precision arithmetic and ended up using a Scheme implementation. The speedup in most cases justified the risk in a few cases.
And once I had this implemented within the Oracle DBMS I was able to compare it with the old sort. The new sort was often about 5 times faster than the old sort. I then compared it with SyncSort. I don't remember whether they had a DeWitt Clause so I won't share the results but I will say that the new sort in Oracle looked great in comparison.

The End

The new sorted landed in 10gR2 and was featured in a white-paper. I also got a short email from Larry Ellison thanking me for the work. A promotion or bonus would have to wait as you had to play the long-game in your career at Oracle. And that was all the motivation I needed to leave Oracle -- first for a startup, and then to Google and Facebook.

After leaving Oracle, much of my time was spent on making MySQL better. Great open-source DBMS, like MySQL and PostgreSQL, were not good for Oracle's new license revenue. Oracle is a better DBMS, but not everyone needs it or can afford it.

No, You Can't (Officially) Reconnect Cut Pieces of Hue Strip Lights

Hopefully this saves someone else an hour of digging. Philips Hue has a comparison page for their strip lights. This table says that for the Ambiance Gradient lightstrips, “Cut pieces can be reconnected”. Their “Can you cut LED strip lights” page also says “Many strip lights are cuttable, and some even allow the cut parts to be reused and added to the strip light using a connector. Lightstrip V4 and many of the latest models will enable this level of customization.”

Lightstrip V4 has been out for six years, but at least as far as Hue’s official product line goes, the statement “Cut pieces can be reconnected” is not true. Hue’s support representative told me that Hue might someday release a connector which allows reconnecting cut pieces. However, that product does not exist yet, and they can’t say when it might be released.

In the meantime, there’s a company called Litcessory which makes third-party adapters for Hue strips. I haven’t tried them, but I think they might do the trick.

Running Databases on Kubernetes: A Practical Guide to Risks, Benefits, and Best Practices

As a database administrator, you are the guardian of the company’s most critical asset: its data. You live by performance, reliability, and security, ensuring every change maintains uptime and data integrity. That level of precision takes time, as every update, patch, and configuration is tested before it goes live. Meanwhile, application teams have fully embraced […]

January 01, 2026

MongoDB equivalent to FOR UPDATE SKIP LOCKED

SELECT ... FOR UPDATE SKIP LOCKED is a vendor-specific SQL feature available in several relational databases (e.g., PostgreSQL, Oracle, MySQL). It helps parallel workers avoid waiting on locked rows. MongoDB’s concurrency model uses optimistic concurrency: reads don’t block writes, and writes don’t block reads. To coordinate parallel processing, you can reserve a document by writing a lock field so other workers skip it.

I'll use an example discussed in the Reddit question "ACID read then write – Python":

Client in python, multi process. Each process picks and reads one document, calls some public APIs, and add data to the document and save it. Then next document. What is written can depend on the read data. Question is: in python, how can I create and configure transactions in the code to make sure no other process can read or write its current document from the moment a process starts reading it until done writing its additional data? This means concurrent reads should not happen…

In this example, I'll process messages based on their originating IP address. Multiple threads will enrich them with location data fetched from the public API at https://ip-api.com/.

Here is an example of an initial document:

{
  _id: ObjectId('6956e772baea71e37a818e73'),
  originatingIp: '1.1.1.1',
  location: null
}

Here is the document while it is being processed:

{
  _id: ObjectId('6956e772baea71e37a818e73'),
  originatingIp: '1.1.1.1',
  location: null,
  lock: {
    by: 'franck',
    until: datetime.datetime(2026, 1, 1, 22, 33, 10, 833000)
  }
}

Here is the same document after processing:

{
  _id: ObjectId('6956e772baea71e37a818e73'),
  originatingIp: '1.1.1.1',
  location: {
    status: 'success',
    country: 'Hong Kong',
    countryCode: 'HK',
    region: 'HCW',
    regionName: 'Central and Western District',
    city: 'Hong Kong',
    zip: '',
    lat: 22.3193,
    lon: 114.1693,
    timezone: 'Asia/Hong_Kong',
    isp: 'Cloudflare, Inc',
    org: 'APNIC and Cloudflare DNS Resolver project',
    as: 'AS13335 Cloudflare, Inc.',
    query: '1.1.1.1'
  }
}

Storing in-process information avoids long transactions that hide the current status and make troubleshooting difficult when the public API is slow.

Design

This script is designed as a complete, runnable demonstration of how to implement SELECT ... FOR UPDATE SKIP LOCKED-style parallel job claiming in MongoDB. The script will generate everything it needs, process it, and show the end state.

  • insert_test_docs() inserts test data with random IP addresses in a new collection "message", and creates a partial index to get the message to process ({location: null}).
  • claim_document() updates a message to process, adding lock information so that another thread will not pick the same, and fetches the document. The criteria are that it must be processed ({location: null}) and not locked, or the lock must have expired (with a 1s grace to account for clock skew).
  • fetch_location() is the call to the public API, here getting location information for an IP address.
  • process_document() calls claim_document() to get a message to process, with a lock. It calls fetch_location() and updates the document with the location. It ensures the lock is still in place before the update, then unsets it. Each thread runs in a loop, claiming and processing documents until the timeout.
  • main() calls those in sequence and displays the final documents.

This solution avoids explicit transactions, which is preferable because they would include a call to a public API with unpredictable response time. It also avoids using findOneAndUpdate, whose higher overhead comes from storing full pre- and post-images of documents for retryable operations. For large documents—possible in real workloads, even if not shown in this demo—this would lead to significant write amplification. Finally, setting an expiration timestamp allows automatic re-processing if a message fails.

Code

Below is the complete Python program, which you can test using different numbers of documents and threads:

import os
import random
import socket
import threading
import time
from datetime import datetime, timedelta
import requests
from pymongo import MongoClient


# Mongo connection and collection
client = MongoClient("mongodb://127.0.0.1:27017/?directConnection=true")
db = client.test
messages = db.message

#  Test settings (the test inserts documents, then runs the processing threads for some duration)
DOCUMENTS = 10 # number of documents created initially
THREADS = 5    # number of threads that loop to claim a document
SECONDS = 15   # thread stops looping on claim

# Worker identity (to identify the thread, and set an expiration on the lock)
WORKER_ID = f"{socket.gethostname()}-{os.getpid()}"
LOCK_DURATION = timedelta(seconds=60) # assumes processing completes within that duration, if not, it will be claimed by another, and this one will not update it

# Get the time
def utcnow(): return datetime.utcnow()
MAX_CLOCK_SKEW=timedelta(seconds=1) # used as a grace period when lock is expired

# --- Prepare test messages (with random generated IP) ---
def insert_test_docs():
    # Drop the collection completely (removes data + indexes)
    messages.drop()
    # Create the partial index for unprocessed docs  (they have location = null )
    messages.create_index(  [("lock.until", 1)],  partialFilterExpression={"location": None}  )
    # Generate random IPs for the test
    ips = [
        ".".join(str(random.randint(1, 255)) for _ in range(4))
        for _ in range(DOCUMENTS)
    ]
    # Explicitly set location=None to match the partial index filter
    docs = [
        { "originatingIp": ip, "location": None  } # A null location is the marker to process it
        for ip in ips
    ]
    messages.insert_many(docs)
    print(f"[STARTUP] Inserted {DOCUMENTS} test docs into 'message'")
    for doc in messages.find({}, {"_id": 0, "originatingIp": 1, "location": 1}):
        print(doc)


# --- Claim a message ---
def claim_document():
    now = utcnow()
    lock_expiry = now + LOCK_DURATION
    token = random.randint(1000, 9999)  # unique lock token for extra safety
    # Atomic lock claim: match unlocked or steal locks expired
    result = messages.update_one(
        {
          "$and": [
            # the location is not set
            { "location": None },
            # the document is not locked, or locked expired including grace period
            {  "$or": [  { "lock": { "$exists": False } },  { "lock.until": { "$lt": now - MAX_CLOCK_SKEW } }  ]  }
          ]
        },
        { "$set": {  "lock": {  "by": WORKER_ID,  "until": lock_expiry,  "token": token  }  }}
    )
    if result.modified_count == 0:
        return None
    # Fetch exactly the doc we locked — match by worker, expiry, AND token
    doc = messages.find_one({  "lock.by": WORKER_ID,  "lock.until": lock_expiry,  "lock.token": token  })
    if doc:
        print(f"[{WORKER_ID}] {threading.current_thread().name} claimed IP {doc['originatingIp']} with token={token}")
    else:
        print(f"[{WORKER_ID}] {threading.current_thread().name} claim succeeded but fetch failed — possible race?")
    return doc

# --- Call the public API ---
def fetch_location(ip):
    url = f"http://ip-api.com/json/{ip}"
    try:
        resp = requests.get(url, timeout=30)
        if resp.status_code == 200:
            return resp.json()
        print(f"[API] Error: HTTP {resp.status_code} for {ip}")
        return None
    except Exception as e:
        print(f"[API] Exception for {ip}: {e}")
        return None

# --- Process messages in a loop ---
def process_document():
    start_time = time.time()
    timeout = SECONDS  # seconds
    thread_name = threading.current_thread().name
    while True:
        # Try to claim a doc
        doc = claim_document()
        if doc:
            # We successfully claimed a doc — process it
            ip = doc["originatingIp"]
            location_data = fetch_location(ip)
            if not location_data:
                print(f"[{WORKER_ID}] {thread_name} failed to fetch location for {ip}")
                return
            # Final update only if lock is still valid
            now = utcnow()
            result = messages.update_one(
                {
                    "_id": doc["_id"],
                    "lock.by": WORKER_ID,
                    "lock.until": {"$gte": now},
                    "lock.token": doc["lock"]["token"]
                },
                {
                    "$set": {"location": location_data},
                    "$unset": {"lock": ""}
                }
            )
        # No doc claimed — check elapsed time before wait and retry
        elapsed = time.time() - start_time
        if elapsed >= timeout:
            print(f"[{WORKER_ID}] {thread_name} exiting after {elapsed:.2f}s")
            return
        time.sleep(5)  # avoid hammering DB and the public API

# --- Initialize and run multiple processing threads ---
def main():
    print(f"\nInserting documents")
    insert_test_docs()
    print(f"\nStarting threads")
    threads = []
    for i in range(THREADS):
        tname = f"T{i}"
        t = threading.Thread(target=process_document, name=tname)
        t.start()
        threads.append(t)
    for t in threads:
        t.join()
    print(f"\n[{WORKER_ID}] Check final documents:")
    for doc in messages.find({}, {"originatingIp": 1, "location.query": 1, "location.country": 1, "location.message": 1, "lock.by": 1, "lock.until": 1}):
        print(doc)

if __name__ == "__main__":
    main()

Technical Insights

MongoDB’s storage engine guarantees atomicity for each update_one through its WriteUnitOfWork and RecoveryUnit mechanisms. However, maintaining read consistency across multiple operations requires application-level coordination. In this implementation, that coordination is provided by an atomic claim with conditional criteria, ensuring that only one worker can lock an unprocessed or expired document at a time.

Several safeguards mitigate race conditions. The claim step narrows matches using the worker ID, lock expiry, and a random token. The final update then re-verifies all these fields before committing changes, preventing stale or stolen locks from being applied. Lock expiration enables automatic recovery from failures, and a small grace window accounts for clock skew in distributed systems.

Write conflicts during concurrent updates are automatically resolved at the storage layer via optimistic concurrency control. This ensures correctness without blocking other operations. The result is a robust, non-blocking parallel processing workflow that preserves document-level ACID guarantees while scaling effectively in shared or cloud environments.

In this design, each thread processes one message at a time, in index order. Enforcing strict global message ordering would be more complex. The primary goal here is the scalability of the parallel processing.

Final Recommendation

When migrating from PostgreSQL to MongoDB—like between any two databases—avoid a direct feature-by-feature mapping, because the systems are fundamentally different. SKIP LOCKED works around blocking FOR UPDATE reads in PostgreSQL, while reads and writes do not block in MongoDB. Instead of replicating another database behavior, clarify the business requirement and design the most appropriate solution. In this example, rather than relying on generic transaction control like SQL, we modeled object states—such as claim acquisition and expiration—and store that state directly in the documents.

December 31, 2025

CVE-2025-14847 (MongoBleed) — A High-Severity Memory Leak in MongoDB

A high severity vulnerability, referred to as “mongobleed” (CVE-2025-14847) has been identified in most versions of MongoDB Community and Enterprise editions and MongoDB Atlas. Percona Server for MongoDB (PSMDB) is also affected, since it is based on the upstream MongoDB Community code base. This issue affects all MongoDB server binaries where zlib network compression is […]

Building a Multi-Cloud Strategy: Cut Costs, Improve Resilience, and Avoid Lock-In

When you went all-in on the cloud, you were promised agility and savings. But sometimes, the reality feels very different. Instead of simplicity and flexibility, you’re facing higher bills, shrinking options, and a single vendor with all the leverage. You’ve just swapped one form of vendor lock-in for another, and this lack of control can […]

Why isn't "majority" the default read concern in MongoDB?

TL;DR: If you’re used to traditional SQL databases and synchronous request–response flows—where you read your writes in the same transaction or session—use the "majority" read concern in MongoDB and you will have the highest isolation and durability you can expect from a database. It’s not the default, but it’s safe to change it for your connection. The default is optimized for event-driven, microservice architectures with asynchronous communication, where lower latency is preferred even if it means sometimes reading a state that may later be rolled back.

PostgreSQL users typically expect writes to become visible to other sessions only after they’re acknowledged, either via auto-commit DML or an explicit COMMIT. By contrast, in MongoDB, you must enable the "majority" read concern to achieve similar ACID guarantees, and this is not the default. It may seem surprising that MongoDB offers the strongest consistency option—full ACID semantics in a distributed database—yet doesn’t enable it by default, despite seemingly no significant performance impact. This caught my attention and made me want to understand the reasoning behind it. NoSQL and SQL now address similar use cases, but their origins are fundamentally different. Let’s explore that.

Non-blocking read and write concerns

In the SQL standard, isolation levels were first defined by the anomalies (phenomena) that can occur when concurrent sessions read and write the same data. But these definitions were tied to a specific lock-based implementation rather than an abstract model: they assumed that reads and writes use locks and that active transactions share and modify a single current database state.

In reality, many databases chose different designs for scalability:

  • Non-blocking reads with MVCC (e.g., PostgreSQL or MongoDB) show anomalies not covered by the standard—"write skew," for instance—and support isolation levels like Snapshot Isolation (SI), which differs from the SQL definitions, even though PostgreSQL uses the name Repeatable Read to match the SQL standard.
  • Non-blocking writes (e.g., in MongoDB) detect write conflicts immediately and raise a retryable exception instead of waiting for lock acquisition, also known as optimistic concurrency control.

To understand isolation and durability in MongoDB, we must first consider read and write concerns independently, especially in a replicated, distributed setup where reads and writes can hit different servers. Then we can examine how they interact when we read after writing.

Isolation and durability

First, let’s distinguish isolation and durability — the I and D in ACID:

  • Isolation defines how reads and writes from different sessions are visible to one another. To preserve atomicity, it must hide intermediate states of uncommitted writes until the transaction completes and should also prevent stale reads that miss previously committed writes.
  • Durability ensures that once data is written, it remains persistent and is not lost after a failure. Similarly, to prevent dirty reads that might later be rolled back during failure recovery, data that has already been read should also be guaranteed to remain persistent.

Initially, these definitions assumed a single-node database. In modern systems, durability must also handle network and data center failures, so data is persisted across multiple nodes rather than just on a local disk.

A commit, whether in an explicit transaction or implicit in a write operation, typically proceeds as follows:

  1. Commit is initiated.
  2. The write-ahead log is flushed to local disk (local durability).
  3. The write-ahead log is flushed to the remote disk (global durability).
  4. Changes become visible (end of isolation) to other sessions.
  5. The commit is acknowledged in the session.

Durability and isolation each involve multiple operations, and their order can vary. The sequence above matches PostgreSQL with synchronous_commit = on, or MongoDB with w:majority and a majority read concern in other sessions.

Other configurations are possible. For example, Oracle Database uses a different order for durability and isolation, making changes visible before the redo log is flushed (except when paranoid_concurrency_mode is set). With PostgreSQL synchronous_commit = local or MongoDB w:1, acknowledgment occurs before global durability. With MongoDB’s local read concern, data becomes visible before it is durable.

Why isn’t the above sequence—which seems to offer the strongest isolation and durability—the default in MongoDB?

Read after a write with asynchronous calls

There is another anomaly not described by the SQL standard, which assumes that read and write locks on a single database state are mutually exclusive. With MVCC, a transaction instead works with two states:

  • Read time is the start of the transaction (or the start of the statement in Read Committed transactions). All reads use a snapshot from this time.
  • Write time is the end of the transaction, since all writes must appear to occur atomically at commit.

Because the read time is earlier than the write time, another anomaly can occur:

  1. Microservice A writes an event, assumes it will be persisted and visible, and notifies microservice B.
  2. Microservice B receives the notification and reads the event, assuming it is visible.
  3. Microservice A receives the write acknowledgment a few milliseconds later, especially if global durability must be confirmed.

In a non-MVCC database with blocking reads, this preserves causality because, in step 2, microservice B requires a share lock and waits on an exclusive lock acquired by A and released at step 3, so B sees the write only after it acquires the share lock, after step 3. Non-MVCC is rare (e.g., DB2 or SQL Server without RCSI isolation level), but SQL isolation levels were defined based on it, and didn't mention causality.

Keep in mind that in this example, the application doesn’t wait for the write acknowledgment before telling the other service to read, yet it still expects the write to be complete when the read occurs. Read-after-write causality was guaranteed with read locks in the non-MVCC database.

However, in an MVCC database, as in most modern systems, microservice B may read a state from before a write is visible, causing a read-after-write anomaly. If the write is acknowledged only locally—for example, PostgreSQL with synchronous_commit = local or MongoDB with w:1—it will likely be visible by the time B receives the notification, because the write usually completes faster than the notification is delivered.

By contrast, PostgreSQL with synchronous_commit = on, or MongoDB with majority read concern, may not see the write yet if it has not been replicated to a majority. Thus, when using w:1, users should select the local read concern to avoid read-after-write anomalies. w:1 is not the default. Still, it can be chosen to reduce latency, at the risk of losing events on failure—something event-driven architectures can often tolerate.

With PostgreSQL synchronous_commit = on or MongoDB w:majority (the default), writes incur extra network latency because they must wait for remote acknowledgment. In this case, the scenario can still show a read-after-write anomaly if the majority has not yet acknowledged microservice A's write when microservice B reads. Using MongoDB local read concern avoids this anomaly, but risks reading data that might later be rolled back on failure.

"local" is the default, but use "majority"

The default read concern is well-suited to event-driven architectures. As event-driven systems were a primary use case for NoSQL databases like MongoDB, retaining this default makes sense, at least for backward compatibility. Users also often expect reads to return the latest changes, even if those changes have not yet been acknowledged in the thread that performed the write operation.

Today, MongoDB is also used with traditional architectures, where it’s reasonable to prefer durability over fast visibility and use the "majority" read concern. This adds no performance penalty, because you already paid the synchronization latency when waiting for the write acknowledgment. "Majority" read concern sets the read time to the last commit time, while keeping reads local. It can wait in rare cases, such as during instance startup or rollback, until it can obtain a committed timestamp snapshot, or when secondaries are unavailable or lagging. But generally, there's no performance impact.

Unlike SQL databases—which must guarantee consistency for any DML executed by any user, including non-programmers at the command line—MongoDB shifts more responsibility to developers. Instead of relying on a one-size-fits-all default, developers must configure their session or connection by choosing:

  • the write concern (for example, w:majority for durability over network or data center failures),
  • the read concern (such as majority, or snapshot for stronger consistency in multi-shard transactions), and
  • the read preference (to scale reads across replicas when some staleness is acceptable). This configuration lets MongoDB adapt to different consistency and performance expectations.

ClickHouse® Kafka Engine vs Tinybird Kafka Connector

Compare ClickHouse OSS Kafka Engine and Tinybird's Kafka connector. Understand tradeoffs, failure modes and when to choose each solution for your Kafka to ClickHouse pipeline.

December 30, 2025

Performance for RocksDB 9.8 through 10.10 on 8-core and 48-core servers

This post has results for RocksDB performance using db_bench on 8-core and 48-core servers. I previously shared results for RocksDB performance using gcc and clang and then for RocksDB on a small Arm server

tl;dr

  • RocksDB is boring, there are few performance regressions. 
  • There was a regression in write-heavy workloads with RocksDB 10.6.2. See bug 13996 for details. That has been fixed.
  • I will repeat tests in a few weeks

Software

I used RocksDB versions 9.8 through 10.0.

I compiled each version clang version 18.3.1 with link-time optimization enabled (LTO). The build command line was:

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
    make "${flags[@]}" static_lib db_bench

Hardware

I used servers with 8 and 48 cores, both run Ubuntu 22.04:

  • 8-core
    • Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM.
    • storage is one NVMe SSD with discard enabled and ext-4
    • benchmarks are run with 1 client, 20M KV pairs for byrx and 400M KV pairs for iobuf and iodir
  • 48-core
    • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM
    • storage is 2 SSDs with RAID 1 (3.8T each) and ext-4.
    • benchmarks are run with 36 clients, 200M KV pairs for byrx and 2B KV pairs for iobuf and iodir

Benchmark

Overviews on how I use db_bench are here and here.

Most benchmark steps were run for 1800 seconds and all used the LRU block cache. I try to use Hyperclock on large servers but forgot that this time.

Tests were run for three workloads:

  • byrx - database cached by RocksDB
  • iobuf - database is larger than RAM and RocksDB used buffered IO
  • iodir - database is larger than RAM and RocksDB used O_DIRECT

The benchmark steps that I focus on are:
  • fillseq
    • load RocksDB in key order with 1 thread
  • revrangeww, fwdrangeww
    • do reverse or forward range queries with a rate-limited writer. Report performance for the range queries
  • readww
    • do point queries with a rate-limited writer. Report performance for the point queries.
  • overwrite
    • overwrite (via Put) random keys and wait for compaction to stop at test end

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
    (QPS for my version / QPS for RocksDB 9.8)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than RocksDB 9.8. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here. Performance summaries are here.

Results: cached database (byrx)

From 1 client on the 8-core server

  • Results are stable except for the overwrite test where there might be a regression, but I think that is noise after repeating this test 2 more times and the cause is that the base case (result from 9.8) was an outlier. I will revisit this.

From 36 clients on the 48-core server

  • Results are stable

Results: IO-bound with buffered IO (iobuf)

From 1 client on the 8-core server

  • Results are stable except for the overwrite test where there might be a large improvement. But I wonder if this is from noise in the result for the base case from RocksDB 9.8, just as there might be noice in the cached (byrx) results.
  • The regression in fillseq with 10.6.2 is from bug 13996

From 36 clients on the 48-core server
  • Results are stable except for the overwrite test where there might be a large improvement. But I wonder if this is from noise in the result for the base case from RocksDB 9.8, just as there might be noice in the cached (byrx) results.
  • The regression in fillseq with 10.6.2 is from bug 13996
Results: IO-bound with O_DIRECT (iodir)

From 1 client on the 8-core server

  • Results are stable
  • The regression in fillseq with 10.6.2 is from bug 13996

From 36 clients on the 48-core server

  • Results are stable
  • The regression in fillseq with 10.6.2 is from bug 13996


Migrate to Freedom: Choosing a Truly Open Source PostgreSQL Operator

Open Source Isn’t What It Used to Be The landscape of open source has undergone significant changes in recent years, and selecting the right operator and tooling for PostgreSQL clusters in Kubernetes has never been more crucial. MinIO, for example, was a widely used open source S3-compatible storage backend. Over the past few years, it has: […]

IO-bound sysbench vs Postgres on a 48-core server

This has results for an IO-bound sysbench benchmark on a 48-core server for Postgres versions 12 through 18. Results from a CPU-bound sysbench benchmark on the 48-core server are here.

tl;dr - for Postgres 18.1 relative to 12.22

  • QPS for IO-bound point-query tests is similar while there is a large improvement for the one CPU-bound test (hot-points)
  • QPS for range queries without aggregation is similar
  • QPS for range queries with aggregation is between 1.05X and 1.25X larger in 18.1
  • QPS for writes show there might be a few large regressions in 18.1
tl;dr - for Postgres 18.1 using different values for the io_method option
  • for tests that do long range queries without aggregation
    • the best QPS is from io_method=io_uring
    • the second best QPS is from io_method=worker with a large value for io_workers
  • for tests that do long range queries with aggregation
    • when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
    • for most tests the best QPS is from io_method=io_uring

Builds, configuration and hardware

I compiled Postgres from source for versions 12.22, 13.23, 14.20, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

I used a 48-core server from Hetzner
  • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
  • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
  • 128G RAM
  • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
Configuration files for the big server
Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 250M rows per table. With 250M rows per table this is IO-bound. I normally use 10M rows per table for CPU-bound workloads.

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries without aggregation while part 2 has queries with aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than base version.

I provide two comparisons and each uses a different base version. They are:
  • base version is Postgres 12.22
    • compare 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1
    • the goal for this is to see how performance changes over time
    • per-test results from vmstat and iostat are here
  • base version is Postgres 18.1
    • compare 18.1 using the x10b_c32r128, x10c_c32r128, x10cw8_c32r128, x10cw16_c32r128, x10cw32_c32r128 and x10d_c32r128 configs
    • the goal for this is to understand the impact of the io_method option
    • per-test results from vmstat and iostat are here
The per-test results from vmstat and iostat can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: Postgres 12.22 through 18.1

All charts except the first have the y-axis start at 0.7 rather than 0.0 to improve readability.

There are two charts for point queries. The second truncates the y-axis to improve readability.
  • a large improvement for the hot-points test arrives in 17.x. While most tests are IO-bound, this test is CPU-bound because all queries fetch the same N rows.
  • for other tests there are small changes, both improvements and regressions, and the regressions are too small to investigate
For range queries without aggregation:
  • QPS for Postgres 18.1 is within 5% of 12.22, sometimes better and sometimes worse
  • for Postgres 17.7 there might be a large regression on the scan test and that also occurs with 17.6 (not shown). But the scan test can be prone to variance, especially with Postgres and I don't expect to spend time debugging this. Note that the config I use for 18.1 here uses io_method=sync which is similar to what Postgres uses in releases prior to 18.x. From the vmstat and iostat metrics what I see is:
    • a small reduction in CPU overhead (cpu/o) in 18.1
    • a large reduction in the context switch rate (cs/o) in 18.1
    • small reductions in read IO (r/o and rKB/o) in 18.1
For range queries with aggregation:
  • QPS for 18.1 is between 1.05X and 1.25X better than for 12.22
For write-heavy tests
  • there might be large regressions for several tests: read-write, update-zipf and write-only, The read-write tests do all of the writes done by write-only and then add read-only statements. 
  • from the vmstat and iostat results for the read-write tests I see
    • CPU (cpu/o) is up by ~1.2X in PG 16.x through 18.x
    • storage reads per query (r/o) have been increasing from PG 16.x through 18.x and are up by ~1.1X in PG 18.1
    • storage KB read per query (rKB/o) increased started in PG 16.1 and are 1.44X and 1.16X larger in PG 18.x
  • from the vmstat and iostat results for the update-zipf test
    • results are similar to the read-write tests above
  • from the vmstat and iostat results for the write-only test
    • results are similar to the read-write tests above
Results: Postgres 18.1 and io_method

For point queries
  • results are similar for all configurations and this is expected
For range queries without aggregation
  • there are two charts, the y-axis is truncated in the second to improve readability
  • all configs get similar QPS for all tests except scan
  • for the scan test
    • the x10c_c32r128 config has the worst result. This is expected given there are 40 concurrent connections and it used the default for io_workers (=3)
    • QPS improves for io_method=worker with larger values for io_workers
    • io_method=io_uring has the best QPS (the x10d_c32r128 config)
For range queries with aggregation
  • when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
  • io_method=io_uring gets the best QPS on all tests except for the read-only tests with range=10 and 10,000. There isn't an obvious problem based on the vmstat and iostat results. From the r_await column in iostat output (not shown) the differences are mostly explained by a change in IO latency. Perhaps variance in storage latency is the issue.
For writes
  • the best QPS occurs with the x10b_c32r128 config (io_method=sync). I am not sure if that option matters here and perhaps there is too much noise in the results.

How to Fix Kafka to ClickHouse® Performance Bottlenecks

Learn how to optimize your Kafka to ClickHouse pipeline performance with schema optimization, Materialized View tuning, partition distribution strategies and throughput best practices.

December 29, 2025

Rethinking the Cost of Distributed Caches for Datacenter Services

This paper (HOTNETS'25) re-teaches a familiar systems lesson: caching is not just about reducing latency, it is also about saving CPU! The paper makes this point concrete by focusing on the second-order effect that often dominates in practice: the monetary cost of computation. The paper shows that caching --even after accounting for the cost of DRAM you use for caching-- still yields 3–4x better cost efficiency thanks to the reduction in CPU usage. In today's cloud pricing model, that CPU cost dominates. DRAM is cheap. Well, was cheap... I guess the joke is on them now, since right after this paper got presented, the DRAM prices jumped by 3-4x! Damn Machine Learning ruining everything since 2018!

Anyways, let's ignore that point conveniently to get back to the paper. Ok, so caches do help, but when do they help the most? Many database-centric or storage-side cache designs miss this point. Even when data is cached at the storage/database cache, an application read still needs to travel there, pay for RPCs, query planning, serialization, and coordination checks. 

The paper advocates for moving the caches as close to the application as possible to cut costs for CPU. The key argument is that application-level linked caches deliver far better cost savings than storage-layer caches. By caching fully materialized application objects and bypassing the storage/database read path entirely, linked caches eliminate query amplification and coordination overhead. Across production workloads, this yields 3–4x better cost efficiency than storage-layer caching, easily offsetting the additional DRAM cost. Remote caches help, but still burn CPU on RPCs and serialization. Storage-layer caches save disk I/O but leave most of the query and coordination path intact, delivering the weakest cost savings. The results are consistent across different access skews and read intensities, reinforcing that cache placement dominates cache size.

So that is the gist of the paper. The paper makes two adjacent points. Special cases of this observation, if you will. And let's cover them for completeness.

The first point is rich-object workloads, which is where the most striking evaluation results come from. For services where a single logical read expands into many database queries (e.g., metadata services and control planes), caching fully materialized objects at the application level avoids query amplification entirely. And this yields up to an order-of-magnitude cost reduction versus uncached reads and roughly 2x improvement over caching denormalized key-value representations.

The second result, a negative result, is also important. Adding even lightweight freshness or version checks largely erases these gains, because the check itself traverses most of the database stack. The experiments make clear that strong consistency remains fundamentally at odds with the cost benefits of application-level caching. The paper leaves this as an open challenge, saying that we still lack a clean, low-cost way to combine strong consistency with the economic benefits of application-level caching. I think it is possible to employ leases to trade off an increase in update latency with cost efficiency, and alleviate this problem. Or we could just say: Cache coherence is hard, let's go shopping for CXL!


Discussion

Overall, the paper quantifies something many practitioners intuit but rarely measure. If you care about cost (also monetary cost), move caching up the stack, cache rich objects, and trade memory against CPU burn.  

As usual Aleksey and I did a live-reading of the paper. And as usual we had a lot to argue and gripe about. Above is a recording of our discussion, and this links to my annotated paper.

Of course, Aleksey zeroed in on the metastability implications right from the abstract. And yes the metastability implications remained unaddressed in the paper. If you cut costs and operate at lower CPU provisioning (thanks to this cache assist), you are making yourself prone to failure by operating at maximum utilization, without any slack. That means, the moment the cache fails or becomes inaccessible, your application will also get overwhelmed by 2-3x more traffic than it can handle and suffer unavailability or metastability.

I had some reservations about application-level caches. They are undeniably effective, but they lack the reusability and black-box nature of storage-layer caching. Storage-side caching is largely free, transparent, and naturally shared across nodes and applications. Application-level caching, by contrast, requires careful design and nontrivial development effort. It also sacrifices reuse and sharing, since each application must manage its own cache semantics and lifecycle. I wish the paper could discuss these costs and tradeoffs.

Writing, after the introduction section, was repetitive and sub par. Sections 2 and 3 largely repeated the Introduction and wasted space. Then we only had 2 paragraphs of the Theoretical Analysis section, which we actually looked forward to reading. That section is effectively cropped out of the paper, when it makes the core of the arguments for the paper. 

The paper's subtitle (see headers on Page 3, 5, 7) is a copy-paste error from the authors' HotNets 2024 paper. There did not seem to be any camera-ready time checks on the paper. To motivate strong consistency, the paper drops several citations in Section 2.3, calling them as recent work. Only 2 out of 6 of these are after 2014. The figures were sloppy as well. Did you notice Figure 6 above? The y-axis are not covering the same ranges, which makes it very hard to compare about the subfigures. The y-axis in Figure 5 uses relative costs, which is also of not much use. It may be that in 2025 most people use LLMs to read papers, but one should still write papers as if humans will read them, past the introduction section, and line by line to understand and check the work.


Finally, here is an interesting question to ponder on. Does this paper conflict with storage disaggregation trend?

At first glance, the paper appears to push against the storage disaggregation trend by arguing for tighter coupling between computation and cached data to meet real-time freshness constraints. In reality, it does not reject disaggregation but warns that disaggregated designs require additional caching above the storage layer.  Just storage side caching would not be able to suffice from a latency as well as cost perspective! The paper also points to a hidden cost: freshness guarantees degrade when cache coherence is treated as a best-effort side effect of an eventually consistent pipeline. The paper's message is that disaggregation needs explicit freshness semantics and coordination mechanisms. So maybe a corollary here is that, we should expect disaggregated systems to inevitably grow "stateful edges" over time in order to recover performance and control.

Update Request! New PostgreSQL RPMs Released to Disable Debug Assertions

We recently identified that a batch of our Percona Server for PostgreSQL RPM packages were inadvertently compiled with the debug assertion flag (–enable-cassert) enabled. While these assertions are invaluable for our developers during the testing phase, they are not intended for production use. We have since rebuilt the packages and strongly recommend that all users […]

Percona Operator for MongoDB in 2025: Making Distributed MongoDB More Predictable on Kubernetes

In 2025, the Percona Operator for MongoDB focused on the hardest parts of running MongoDB in Kubernetes: reliable backups and restores, clearer behavior during elections and restores, better observability at scale, and safer defaults as MongoDB 8.0 became mainstream. The year included real course corrections, such as addressing PBM connection leaks and being explicit about […]