a curated list of database news from authoritative sources

November 10, 2025

The Future of Fact-Checking is Lies, I Guess

Last weekend I was trying to pull together sources for an essay and kept finding “fact check” pages from factually.co. For instance, a Kagi search for “pepper ball Chicago pastor” returned this Factually article as the second result:

Fact check: Did ice agents shoot a pastor with pepperballs in October in Chicago

The claim that “ICE agents shot a pastor with pepperballs in October” is not supported by the available materials supplied for review; none of the provided sources document a pastor being struck by pepperballs in October, and the only closely related reported incident involves a CBS Chicago reporter’s vehicle being hit by a pepper ball in late September [1][2]. Available reports instead describe ICE operations, clergy protests, and an internal denial of excessive force, but they do not corroborate the specific October pastor shooting allegation [3][4].

Here’s another “fact check”:

Fact check: Who was the pastor shot with a pepper ball by ICE

No credible reporting in the provided materials identifies a pastor who was shot with a pepper‑ball by ICE; multiple recent accounts instead document journalists, protesters, and community members being hit by pepper‑ball munitions at ICE facilities and demonstrations. The available sources (dated September–November 2025) describe incidents in Chicago, Los Angeles and Portland, note active investigations and protests, and show no direct evidence that a pastor was targeted or injured by ICE with a pepper ball [1] [2] [3] [4].

These certainly look authoritative. They’re written in complete English sentences, with professional diction and lots of nods to neutrality and skepticism. There are lengthy, point-by-point explanations with extensively cited sources. The second article goes so far as to suggest “who might be promoting a pastor-victim narrative”.

The problem is that both articles are false. This story was broadly reported, as in this October 8th Fox News article unambiguously titled “Video shows federal agent shoot Chicago pastor in head with pepper ball during Broadview ICE protest”. DHS Assistant Secretary Tricia McLaughlin even went on X to post about it. This event definitely happened, and it would not have been hard to find coverage at the time these articles were published. It was, quite literally, all over the news.

Or maybe they’re sort of true. Each summary disclaims that its findings are based on “the available materials supplied for review”, or “the provided materials”. This is splitting hairs. Source selection is an essential part of the fact-checking process, and Factually selects its own sources in response to user questions. Instead of finding authoritative sources, Factually selected irrelevant ones and spun them into a narrative which is the opposite of true. Many readers will not catch this distinction. Indeed, I second-guessed myself when I saw the Factually articles—and I read the original reporting when it happened.

“These conversations matter for democracy,” says the call-to-action at the top of every Factually article. The donation button urges readers to “support independent reporting.”

But this is not reporting. Reporters go places and talk to people. They take photographs and videos. They search through databases, file FOIA requests, read court transcripts, evaluate sources, and integrate all this with an understanding of social and historical context. People go to journalism school to do this.

What Factually does is different. It takes a question typed by a user and hands it to a Large Language Model, or LLM, to generate some query strings. It performs up to three Internet search queries, then feeds the top nine web pages it found to an LLM, and asks a pair of LLMs to spit out some text shaped like a fact check. This text may resemble the truth, or—as in these cases—utterly misrepresent it.

Calling Factually’s articles “fact checks” is a category error. A fact checker diligently investigates a contentious claim, reasons about it, and ascertains some form of ground truth. Fact checkers are held to a higher evidentiary standard; they are what you rely on when you want to be sure of something. The web pages on factually.co are fact-check-shaped slurry, extruded by a statistical model which does not understand what it is doing. They are fancy Mad Libs.

Some times the Mad Libs are right. Some times they’re blatantly wrong. Some times it is clear that the model simply has no idea what it is doing, as in this article where Factually is asked whether it “creates fake fact-checking articles”, and in response turns to web sites like Scam Adviser, which evaluate site quality based on things like domain age and the presence of an SSL certificate, or Scam Detector, which looks for malware and phishing. Neither of these sources has anything to do with content accuracy. When asked if Factually is often incorrect (people seem to ask this a lot) Factually’s LLM process selects sources like DHS Debunks Fake News Media Narratives from June, Buzzfeed’s 18 Science Facts You Believed in the 1990s That Are Now Totally Wrong, and vocabulary.com’s definition of the word “Wrong”. When asked about database safety, Factually confidently asserts that “Advocates who state that MongoDB is serializable typically refer to the database’s support for snapshot isolation,” omitting that Snapshot Isolation is a completely different, weaker property. Here’s a Factually article on imaginary “med beds” which cites this incoherent article claiming to have achieved quantum entanglement via a photograph. If a real fact checker shows you a paper like this with any degree of credulousness, you can safely ignore them.

The end result of this absurd process is high-ranking, authoritative-sounding web pages which sometimes tell the truth, and sometimes propagate lies. Factual has constructed a stochastic disinformation machine which exacerbates the very problems fact-checkers are supposed to solve.

Please stop doing this.

Comparing Integers and Doubles

During automated testing we stumbled upon a problem that boiled down to transitive comparisons: If a=b, and a=c, when we assumed that b=c. Unfortunately that is not always the case, at least not in all systems. Consider the following SQL query:

select a=b, a=c, b=c
from (values(
   1234567890123456789.0::double precision,
   1234567890123456788::bigint,
   1234567890123456789::bigint)) s(a,b,c)

If you execute that in Postgres (or DuckDB, or SQL Server, or ...) the answer is (true, true, false). That is, the comparison is not transitive! Why does that happen? When these systems compare a bigint and a double, they promote the bigint to double and then compare. But a double has only 52 bits of mantissa, which means it will lose precision when promoting large integers to double, producing false positives in the comparison.

This behavior is highly undesirable, first because it confuses the optimizer, and second because (at least in our system) joins work very differently: Hash joins promote to the most restrictive type and discard all values that cannot be represented, as they will never produce a join partner for sure. For double/bigint joins that leads to observable differences between joins and plain comparisons, which is very bad.

How should we compare correctly? Conceptually the situation is clear, an IEEE 754 floating point with sign s, mantissa m, and exponent e represents the values (-1)^s*m*2^e, we just have to compare the integer with that value. But there is no easy way to do that, if we do a int/double comparison in, e.g., C++, the compiler does the same promotion to double, messing up the comparison.

We can get the logic right by doing two conversions: We first convert the int to double and compare that. If the values are not equal, the order is clear and we can use that. Otherwise, we convert the double back to an integer and check if the conversion rounded up or down, and handle the result. Plus some extra checks to avoid undefined behavior (the conversion of intmax64->double->int64 is not defined) and to handle non-finite values, and we get: 

int cmpDoubleInt64(double a, int64_t b) {
   // handle intmax and nan
   if (!(a<=0x1.fffffffffffffp+62)) return 1;

   // fast path comparison
   double bd = b;
   if (a!=bd) return (a<bd)?-1:1;

   // handle loss of precision
   int64_t ai = a;
   if (ai!=b) return (ai<b)?-1:1;
   return 0;
}

Which is the logic that we now use. Who else does it correctly? Perhaps somewhat surprisingly, Python and SQLLite.  Other database systems (and programming languages) that we tested all lost precision during the comparison, leading to tons of problems. IMHO a proper int/double comparison should be available in every programming language, at least as library function. But in most languages (and DBMSes) it isn't. You can use that code above if you ever have this problem.

Taurus MM: A Cloud-Native Shared-Storage Multi-Master Database

This VLDB'23 paper presents Taurus MM, Huawei's cloud-native, multi-master OLTP database built to scale write throughput in clusters between 2 to 16 masters. It extends the single-master TaurusDB design (which we reviewed yesterday) into a multi-master design while following its shared-storage architecture with separate compute and storage layers. Each master maintains its own write-ahead log (WAL) and executes transactions independently; there are no distributed transactions. All masters share the same Log Stores and Page Stores, and data is coordinated through new algorithms that reduce network traffic and preserve strong consistency.

The system uses pessimistic concurrency control to avoid frequent aborts on contended workloads. Consistency is maintained through two complementary mechanisms: a new clock design that makes causal ordering efficient, and a new hybrid locking protocol that cuts coordination cost.


Vector-Scalar (VS) Clocks

A core contribution is the Vector-Scalar (VS) clock, a new type of logical clock that combines the compactness of Lamport clocks with the causal precision/completenes of vector clocks.

Ordinary Lamport clocks are small but they fail to capture causality fully, in both directions. Vector clocks capture causality fully, but scale poorly. An 8-node vector clock adds 64 bytes to every message or log record, which turns into a prohibitive cost when millions of short lock and log messages per second are exchanged in a cluster. Taurus MM solves this by letting the local component of each node's VS clock behave like a Lamport clock, while keeping the rest of the vector to track other masters' progress. This hybrid makes the local counter advance faster (it reflects causally related global events, not just local ones) yet still yields vector-like ordering when needed.

VS clocks can stamp messages either with a scalar or a vector timestamp depending on context. Scalar timestamps are used when causality is already known, such as for operations serialized by locks or updates to the same page. Vector timestamps are used when causality is uncertain, such as across log flush buffers or when creating global snapshots.

I really like the VC clocks algorithm, and how it keeps most timestamps compact while still preserving ordering semantics. It's conceptually related to Hybrid Logical Clocks (HLC) in that it keeps per-node clock values close and comparable, but VS clocks are purely logical, driven by Lamport-style counters instead of synchronized physical time. The approach enables rapid creation of globally consistent snapshots and reduces timestamp size and bandwidth consumption by up to 60%.

I enjoyed the paper's pedagogical style in Section 5, as it walks the reader through deciding whether each operation needs scalar or vector timestamps. This  makes it clear how we can enhance efficiency by applying the right level of causality tracking to each operation.


Hybrid Page-Row Locking

The second key contribution is a hybrid page-row locking protocol. Taurus MM maintains a Global Lock Manager (GLM) that manages page-level locks (S and X) across all masters. Each master also runs a Local Lock Manager (LLM) that handles row-level locks independently once it holds the covering page lock.

The GLM grants page locks, returning both the latest page version number and any row-lock info. Once a master holds a page lock, it can grant compatible row locks locally without contacting the GLM. When the master releases a page, it sends back the updated row-lock state so other masters can reconstruct the current state lazily.

Finally, row-lock changes don't need to be propagated immediately and are piggybacked on the page lock release flow. This helps reduce lock traffic dramatically. The GLM only intervenes when another master requests a conflicting page lock.

This separation of global page locks and local row locks resembles our 2014 Panopticon work, where we combined global visibility and local autonomy to limit coordination overhead.


Physical and Logical Consistency

Taurus MM distinguishes between physical and logical consistency. Physical consistency ensures structural correctness of pages. The master groups log records into log flush buffers (LFBs) so that each group ends at a physically consistent point (e.g., a B-tree split updates parent and children atomically within LFB bounds). Read replicas apply logs only up to group boundaries, avoiding partial structural states without distributed locks.

Logical consistency ensures isolation-level correctness for user transactions (Repeatable Read isolation). Row locks are held until commit, while readers can use consistent snapshots without blocking writers.


Ordering and Replication

Each master periodically advertises the location of its latest log records to all others in a lightweight, peer-to-peer fashion. This mechanism is new in Taurus MM. In single-master TaurusDB, the metadata service (via Metadata PLogs) tracked which log segments were active, but not the current write offsets within them (the master itself notified read replicas of the latest log positions). In Taurus MM, with multiple masters generating logs concurrently, each master broadcasts its current log positions to the others, avoiding a centralized metadata bottleneck.

To preserve global order, each master groups its recent log records (updates from multiple transactions and pages) into a log flush buffer (LFB) before sending it to the Log and Page Stores. Because each LFB may contain updates to many pages, different LFBs may touch unrelated pages. It becomes unclear which buffer depends on which, so the system uses vector timestamps to capture causal relationships between LFBs produced on different masters. Each master stamps an LFB with its current vector clock and also includes the timestamp of the previous LFB, allowing receivers to detect gaps or missing buffers. When an LFB reaches a Page Store, though, this global ordering is no longer needed. The Page Store processes each page independently, and all updates to a page are already serialized by that page's lock and carry their own scalar timestamps (LSNs). The Page Store simply replays each page's log records in increasing LSN order, ignoring the vector timestamp on the LFB. In short, vector timestamps ensure causal ordering across masters before the LFB reaches storage, and scalar timestamps ensure correct within-page ordering after.

For strict transaction consistency, a background thread exchanges full vector (VS) timestamps among masters to ensure that every transaction sees all updates committed before it began. A master waits until its local clock surpasses this merged/pairwise-maxed timestamp before serving the read in order to guarantee a globally up-to-date view. If VS were driven by physical rather than purely logical clocks, these wait times could shrink further.


Evaluation and Takeaways

Experiments on up to eight masters show good scaling on partitioned workloads and performance advantages over both Aurora Multi-Master (shared-storage, optimistic CC) and CockroachDB (shared-nothing, distributed commit).

The paper compares Taurus MM with CockroachDB using TPC-C–like OLTP workloads. CockroachDB follows a shared-nothing design, with each node managing its own storage and coordinating writes through per-key Raft consensus. Since Taurus MM uses four dedicated nodes for its shared storage layer, while CockroachDB combines compute and storage on the same nodes, the authors matched configurations by comparing 2 and 8 Taurus masters with 6- and 12-node CockroachDB clusters, respectively. For CockroachDB, they used its built-in TPC-C–like benchmark; for Taurus MM, the Percona TPC-C variant with zero think/keying time. Results for 1000 and 5000 warehouses show Taurus MM delivering 60% to 320% higher throughput and lower average and 95th-percentile latencies. The authors also report scaling efficiency, showing both systems scaling similarly on smaller datasets (1000 warehouses), but CockroachDB scaling slightly more efficiently on larger datasets with fewer conflicts. They attribute this to CockroachDB’s distributed-commit overhead, which dominates at smaller scales but diminishes once transactions touch only a subset of nodes, whereas Taurus MM maintains consistent performance by avoiding distributed commits altogether.

Taurus MM shows that multi-master can work in the cloud if coordination is carefully scoped. The VS clock is a general and reusable idea, as it provides a middle ground between Lamport and vector clocks. I think VS clocks are useful for other distributed systems that need lightweight causal ordering across different tasks/components.

But is the additional complexity worth it for the workloads? Few workloads may truly demand concurrent writes across primaries. Amazon Aurora famously abandoned its own multi-master mode. Still from a systems-design perspective, Taurus MM contributes a nice architectural lesson.

November 09, 2025

Taurus Database: How to be Fast, Available, and Frugal in the Cloud

This SIGMOD’20 paper presents TaurusDB, Huawei's disaggregated MySQL-based cloud database. TaurusDB refines the disaggregated architecture pioneered by Aurora and Socrates, and provides a simpler and cleaner separation of compute and storage. 

In my writeup on Aurora, I discussed how "log is the database" approach reduces network load, since the compute primary only sends logs and the storage nodes apply them to reconstruct pages. But Aurora did conflate durability and availability somewhat and used quorum-based replication of six replicas for both logs and pages.

In my review of Socrates, I explained how Socrates (Azure SQL Cloud) separates durability and availability by splitting the system into four layers: compute, log, page, and storage. Durability (logs) ensures data is not lost after a crash. Availability (pages/storage) ensures data can still be served while some replicas or nodes fail. Socrates stores pages separately from logs to improve performance but the excessive layering introduces significant architectural overhead.

Taurus takes this further and uses different replication and consistency schemes for logs and pages, exploiting their distinct access patterns. Logs are append-only and used for durability. Log records are independent, so they can be written to any available Log Store nodes. As long as three healthy Log Stores exist, writes can proceed without quorum coordination. Pages, however, depend on previous versions. A Page Store must reconstruct the latest version by applying logs to old pages. To leverage this asymmetry, Taurus uses synchronous, reconfigurable replication for Log Stores to ensure durability, and asynchronous replication for Page Stores to improve scalability, latency, and availability.


But hey, why do we disaggregate in the first place?

Traditional databases were designed for local disks and dedicated servers. In the cloud, this model wastes resources as shown in Figure 1. Each MySQL replica keeps its own full copy of the data, while the underlying virtual disks already store three replicas for reliability. Three database instances mean nine copies in total, and every transactional update is executed three times. This setup is clearly redundant, costly, and inefficient.

Disaggregation fixes this and also brings true elasticity! Compute and storage are separated because they behave differently. Compute is expensive and variable; storage is cheaper and grows slowly. Compute can be stateless and scaled quickly, while storage must remain durable. Separating them allows faster scaling, shared I/O at storage, better resource use, and the capability of scaling compute to zero and restarting quickly when needed.


Architecture overview

Taurus has two physical layers, compute and storage, and four logical components: Log Stores, Page Stores, the Storage Abstraction Layer (SAL), and the database front end. Keeping only two layers minimizes cross-network hops.

The database front end (a modified MySQL) handles queries, transactions, and log generation. The master handles writes; read replicas serve reads.

Log Store stores (well duh!) write-ahead-logs as fixed-size, append-only objects called PLogs. These are synchronously replicated across three nodes. Taurus favors reconfiguration-based replication: If one replicaset fails or lags, a new PLog is created. Metadata PLogs track active PLogs.

Page Store materializes/manages 10 GB slices of page data. Each page version is identified by an LSN, and the Page Store can reconstruct any version. Pages are written append-only, which is 2–5x faster than random writes and gentler on flash. Each slice maintains a lock-free Log Directory mapping (page, version) to log offset. Consolidation of logs into pages happens in memory. Taurus originally prioritized by longest chain first, but then reverted to oldest unapplied write first to prevent metadata buildup. A local buffer pool accelerates log application. For cache eviction, Taurus finds that LFU (least frequently used) performs better than LRU (least recently used), because it keeps these hot pages in cache longer, reducing I/O and improving consolidation throughput. 

Storage Abstraction Layer (SAL) hides the storage complexity from MySQL by serving as an intermediary. It coordinates between Log Stores and Page Stores, manages slice placement, and tracks the Cluster Visible LSN, the latest globally consistent point. SAL advances CV-LSN only when the logs are durable in Log Stores and at least one Page Store has acknowledged them. SAL also batches writes per slice to reduce small I/Os.


Write path and replication

Did you notice the lack of LogStore to PageStore communication in Figure 2 and Figure 3? The paper does not address this directly, but yest there is no direct LogStore-to-PageStore communication. The SAL in the master mediates this instead. SAL first writes logs to the Log Stores for durability. Once acknowledged, SAL forwards the same logs to the relevant Page Stores. This ensures that Page Stores only see durable logs and lets SAL track exactly what each replica has received. SAL monitors per-slice persistent LSNs for Page Stores, and resends missing logs from the Log Stores if it detects regressions.

I think, this choice adds coupling and complexity. A chain-replication design, where LogStores streamed logs directly to PageStores, would simplify the system. This way, SAL wouldn't need to track every PageStore’s persistent LSN. And Log truncation could be driven by LogStores once all replicas confirmed receipt, instead of being tracked by SAL again. 


Read path

Database front ends read data at page granularity. A dirty page in the buffer pool cannot be evicted until its logs have been written to at least one Page Store replica. This ensures that the latest version is always recoverable.

As mentioned above, SAL maintains the last LSN sent per slice. Reads are routed to the lowest-latency Page Store replica. If one is unavailable or behind, SAL retries with others.

Read replicas don't stream WAL directly from the master. Instead, the master publishes which PLog holds new updates. Replicas fetch logs from the Log Stores, apply them locally, and track their visible LSN. They don't advance past the Page Stores' persisted LSNs, keeping reads consistent. This design keeps replica lag below 20 ms even under high load and prevents the master from becoming a bandwidth bottleneck.


Recovery model

If a Log Store fails temporarily, writes to its PLogs pause. For long failures, the cluster re-replicates its data to healthy nodes.

Page Store recovery is more involved. After short outages, a Page Store gossips with peers to catch up. For longer failures, the system creates a new replica by copying another's data. If recent logs were lost before replication, SAL detects gaps in persistent LSNs and replays the missing records from Log Stores. Gossip runs periodically but can be triggered early when lag is detected.

If the primary fails, SAL ensures all Page Stores have every log record persisted in Log Stores. This is the redo phase (similar to ARIES). Then the database front end performs undo for in-flight transactions.


Nitpicks

I can't refrain from bringing up a couple of issues.

First, RDMA appears in Figure 2 as part of the storage network but then disappears entirely until a brief mention in the final "future work" paragraph.

Second, the evaluation section feels underdeveloped. It lacks the depth expected from a system of this ambition. I skipped detailed discussion of this section in my review, as it adds little insight beyond what is discussed in the protocols. 

November 08, 2025

I Want You to Understand Chicago

I want you to understand what it is to live in Chicago.

Every day my phone buzzes. It is a neighborhood group: four people were kidnapped at the corner drugstore. A friend a mile away sends a Slack message: she was at the scene when masked men assaulted and abducted two people on the street. A plumber working on my pipes is upset, and I find out that two of his employees were kidnapped that morning. A week later it happens again.

An email arrives. Agents with guns have chased a teacher into the school where she works. They did not have a warrant. They dragged her away, ignoring her and her colleagues’ pleas to show proof of her documentation. That evening I stand a few feet from the parents of Rayito de Sol and listen to them describe, with anguish, how good Ms. Diana was to their children. What it is like to have strangers with guns traumatize your kids. For a teacher to hide a three-year-old child for fear they might be killed. How their relatives will no longer leave the house. I hear the pain and fury in their voices, and I wonder who will be next.

Understand what it is to pray in Chicago. On September 19th, Reverend David Black, lead pastor at First Presbyterian Church of Chicago, was praying outside the ICE detention center in Broadview when a DHS agent shot him in the head with pepper balls. Pepper balls are never supposed to be fired at the head, because, as the manufacturer warns, they could seriously injure or even kill. “We could hear them laughing as they were shooting us from the roof,” Black recalled. He is not the only member of the clergy ICE has assaulted. Methodist pastor Hannah Kardon was violently arrested on October 17th, and Baptist pastor Michael Woolf was shot by pepper balls on November 1st.

Understand what it is to sleep in Chicago. On the night of September 30th, federal agents rappelled from a Black Hawk helicopter to execute a raid on an apartment building on the South Shore. Roughly three hundred agents deployed flashbangs, busted down doors, and took people indiscriminately. US citizens, including women and children, were grabbed from their beds, marched outside without even a chance to dress, zip-tied, and loaded into vans. Residents returned to find their windows and doors broken, and their belongings stolen.

Understand what is is to lead Chicago. On October 3rd, Alderperson Jesse Fuentes asked federal agents to produce a judicial warrant and allow an injured man at the hospital access to an attorney. The plainclothes agents grabbed Fuentes, handcuffed her, and took her outside the building. Her lawsuit is ongoing. On October 21st, Representative Hoan Huynh was going door-to-door to inform businesses of their immigration rights when he was attacked by six armed CBP agents, who boxed in his vehicle and pointed a gun at his face. Huynh says the agents tried to bash open his car window.

Understand what it is to live in Chicago. On October 9th, Judge Ellis issued a temporary restraining order requiring that federal agents refrain from deploying tear gas or shooting civilians without an imminent threat, and requiring two audible warnings. ICE and CBP have flaunted these court orders. On October 12th, federal agents shoved an attorney to the ground who tried to help a man being detained in Albany Park. Agents refused to identify themselves or produce a warrant, then deployed tear gas without warning. On October 14th, agents rammed a car on the East Side, then tear-gassed neighbors and police.

On October 23rd, federal agents detained seven people, including two U.S. citizens and an asylum seeker, in Little Village. Two worked for Alderperson Michael Rodriguez: his chief of staff Elianne Bahena, and police district council member Jacqueline Lopez. Again in Little Village, agents tear-gassed and pepper-sprayed protestors, seizing two high school students and a security guard, among others. Alderperson Byron Sigcho-Lopez reported that agents assaulted one of the students, who had blood on his face. On October 24th, agents in Lakeview emerged from unmarked cars, climbed a locked fence to enter a private yard, and kidnapped a construction worker. As neighbors gathered, they deployed four tear-gas canisters. That same day, a few blocks away, men with rifles jumped out of SUVs and assaulted a man standing at a bus stop.

“They were beating him,” said neighbor Hannah Safter. “His face was bleeding”.

They returned minutes later and attacked again. A man from the Laugh Factory, a local comedy club, had come outside with his mother and sister. “His mom put her body in between them, and one of the agents kicked her in the face”.

Understand what it is to raise a family in Chicago. The next day, October 25th, federal agents tear-gassed children in Old Irving Park. Again, no warnings were heard. On October 26th, agents arrested a 70-year-old man and threw a 67-year old woman to the ground in Old Irving Park, then tear-gassed neighbors in Avondale. That same day, federal agents deployed tear gas at a children’s Halloween parade in Old Irving Park.

“Kids dressed in Halloween costumes walking to a parade do not pose an immediate threat to the safety of a law enforcement officer. They just don’t. And you can’t use riot control weapons against them,” Judge Ellis said to Border Patrol chief Gregory Bovino.

Understand how the government speaks about Chicago. On November 3rd, paralegal Dayanne Figueroa, a US citizen, was driving to work when federal agents crashed into her car, drew their guns, and dragged her from the vehicle. Her car was left behind, coffee still in the cup holder, keys still in the car. The Department of Homeland Security blamed her, claiming she “violently resisted arrest, injuring two officers.” You can watch the video for yourself.

“All uses of force have been more than exemplary,” Bovino stated in a recent deposition. He is, as Judge Ellis has stated, lying. Bovino personally threw a tear-gas canister in Little Village. He claimed in a sworn deposition that he was struck in the head by a rock before throwing the canister, and when videos showed no rock, admitted that he lied about the event. When shown video of himself tackling peaceful protestor Scott Blackburn, Bovino refused to acknowledge that he tackled the man. Instead, he claimed, “That’s not a reportable use of force. The use of force was against me.”

“I find the government’s evidence to be simply not credible,” said Judge Ellis in her November 6th ruling. “The use of force shocks the conscience.”

Understand what it is to be Chicago. To carry a whistle and have the ICIRR hotline in your phone. To wake up from nightmares about shouting militiamen pointing guns at your face. To rehearse every day how to calmly refuse entry, how to identify a judicial warrant, how to film and narrate an assault. To wake to helicopters buzzing your home, to feel your heart rate spike at the car horns your neighbors use to alert each other to ICE and CBP enforcement. To know that perhaps three thousand of your fellow Chicagoans have been disappeared by the government, but no one really knows for sure. To know that many of those seized were imprisoned a few miles away, as many as a hundred and fifty people in a cell, denied access to food, water, sanitation, and legal representation. To know that many of these agents—masked, without badge numbers or body cams, and refusing to identify themselves—will never face justice. To wonder what they tell their children.

The masked thugs who attack my neighbors, who point guns at elected officials and shoot pastors with pepper balls, who tear-gas neighborhoods, terrify children, and drag teachers and alderpeople away in handcuffs are not unprecedented. We knew this was coming a year ago, when Trump promised mass deportations. We knew it was coming, and seventy-seven million of us voted for it anyway.

This weight presses upon me every day. I am flooded with stories. There are so many I cannot remember them all; cannot keep straight who was gassed, beaten, abducted, or shot. I write to leave a record, to stare at the track of the tornado which tears through our city. I write to leave a warning. I write to call for help.

I want you to understand, regardless of your politics, the historical danger of a secret police. What happens when a militia is deployed in our neighborhoods and against our own people. Left unchecked, their mandate will grow; the boundaries of acceptable identity and speech will shrink. I want you to think about elections in this future. I want you to understand that every issue you care about—any hope of participatory democracy—is downstream of this.

I want you to understand what it is to love Chicago. To see your neighbors make the heartbreaking choice between showing up for work or staying safe. To march two miles long, calling out: “This is what Chicago sounds like!” To see your representatives put their bodies on the line and their voices in the fight. To form patrols to walk kids safely to school. To join rapid-response networks to document and alert your neighbors to immigration attacks. For mutual aid networks to deliver groceries and buy out street vendors so they can go home safe. To talk to neighbor after neighbor, friend after friend, and hear yes, yes, it’s all hands on deck.

I want you to understand Chicago.

November 07, 2025

How to Set Up Valkey, The Alternative to Redis

New to Valkey? This guide walks you through the basics and helps you get up and running. Starting with new tech can feel overwhelming, but if you’re ready to explore Valkey, you probably want answers, not some fancy sales pitch. Let’s cut to the chase: Switching tools or trying something new should never slow you […]

November 06, 2025

PostgreSQL 13 Is Reaching End of Life. The Time to Upgrade is Now!

PostgreSQL 13 will officially reach End-of-Life (EOL) on November 13, 2025. After this date, the PostgreSQL Global Development Group will stop releasing security patches and bug fixes for this version. That means if you’re still running PostgreSQL 13, you’ll soon be on your own with no updates, no community support, and growing security risks. Why […]

Query Compilation Isn't as Hard as You Think

Query compilation based on the produce/consume model has a reputation for delivering high query performance, but also for being more difficult to implement than interpretation-based engines using vectorized execution. While it is true that building a production-grade query compiler with low compilation latency requires substantial infrastructure and tooling effort what is sometimes overlooked is that the vectorization paradigm is not easy to implement either, but for different reasons.

Vectorization relies on tuple-at-a-time rather than vector-at-a-time processing. High-level operations (e.g., inserting tuples into a relation) must be decomposed into per-attribute vector kernels that are then executed successively. This requires a specific way of thinking and can be quite challenging, depending on the operator. In practice, the consequence is that the range of algorithms that can be realistically implemented in a vectorized engine is limited.

In contrast, compilation is fundamentally simpler and more flexible in terms of the resulting code. It supports tuple-at-a-time processing, which makes it easier to implement complex algorithmic ideas. Developing a query processing algorithm typically involves three steps: First, implement the algorithm manually as a standalone program, i.e., outside the compiling query engine. Second, explore different algorithms, benchmark, and optimize this standalone implementation. Finally, once a good implementation has been found, modify the query compiler to generate code that closely resembles the manually developed version.

At TUM, we use a simple pedagogical query compilation framework called p2c ("plan to C") for teaching query processing. It implements a query compiler that, given a relational operator tree as input, generates the C++ code that computes the query result. The core of the compiler consists of about 600 lines of C++ code and supports the TableScan, Selection, Map, Sort, Aggregation, and Join operators.

The generated code is roughly as fast as single-threaded DuckDB and can be easily inspected and debugged, since it is simply straightforward C++ code. This makes it easy for students to optimize (e.g., by adding faster hash tables or multi-core parallelization) and to extend (e.g., with window functions).

To keep p2c simple, it does not support NULL values or variable-size data types (strings have a fixed maximum length). Compiling to template-heavy C++ results in very high compilation times, making it impractical for production use. At the same time, p2c employs the same core concepts as Hyper and Umbra, making it an excellent starting point for learning about query compilation. Note that this type of compilation to C++ can also serve as an effective prototyping platform for research that explores new query processing ideas.

Find the code for p2c on github: https://github.com/viktorleis/p2c

November 05, 2025

TLA+ Modeling of AWS outage DNS race condition

On Oct 19–20, 2025, AWS’s N. Virginia region suffered a major DynamoDB outage triggered by a DNS automation defect that broke endpoint resolution. The issue cascaded into a region-wide failure lasting nearly a full day and disrupted many companies’ services. As with most large-scale outages, the “DNS automation defect” was only the trigger; deeper systemic fragilities (see my post on the Metastable Failures in the Wild paper) amplified the impact. This post focuses narrowly on the race condition at the core of the bug, which is best understood through TLA+ modeling.

My TLA+ model builds on Waqas Younas’s Promela/Spin version. To get started quickly, I asked ChatGPT to translate his Promela model into TLA+, which turned out to be a helpful way to understand the system’s behavior, much more effective than reading the postmortem or prose descriptions of the race.

The translation wasn’t perfect, but fixing it wasn’t hard. The translated model treated the enactor’s logic as a single atomic action. In that case, no race could appear: the enactor always completed in uninterrupted fashion. Splitting the action into three steps (receive the DNS plan, apply it, and clean up old plans) exposed the race condition clearly. I then worked further to simplify the model to its essence. 

Below, I walk through the model step by step. This is meant as an exercise in practical TLA+ modeling. We need more such walkthroughs to demystify how model checking works in real debugging. You can even explore this one interactively, without installing anything, using the browser-based TLA+ tool Spectacle. Spectacle, developed by my brilliant colleague Will Schultz at MongoDB research provides an interactive playground for exploring and sharing TLA+ specifications in the browser. But patience: I’ll need to explain the model first before we can do a walk through of the trace.

My model is available here. It starts by defining the constants and the variable names.  There is one planner process and a set of enactor processes. I define ENACTORS as {e1, e2} as two processes in the config file

These variables are initialized in a pretty standard way. And the Planner process action is also straightforward.  It creates the next plan version if the maximum hasn’t been reached, increments latest_plan, and appends the new plan to plan_channel for enactors to pick up. All other state variables remain unchanged.

As I mentioned, the enactor process has three actions. It’s not a single atomic block but three subactions. The apply step is simple. EnactorReceiveStep(self) models an enactor taking a new plan from the shared queue. If the queue isn’t empty and the enactor is idle (pc=0), it assigns itself the plan at the head, advances its program counter to 1 (ready to apply), removes that plan from the queue, and leaves other state unchanged. 

The EnactorApplyStep models how an enactor applies a plan. If the plan’s processing isn’t deleted, it makes that plan current, updates the highest plan applied, and advances its counter to 2. If the plan was deleted, it resets its state (processing and pc to 0).

The EnactorCleanupStep runs when an enactor finishes applying a plan (pc=2). It updates plan_deleted using Cleanup, by marking plans older than PLAN_AGE_THRESHOLD as deleted. It then resets that enactor’s state (processing and pc to 0), and leaves all other variables unchanged.

Next defines the system’s state transitions: either the planner generates a new plan or an enactor executes one of its steps (receive, apply, cleanup).

NeverDeleteActive asserts that the currently active plan must never be marked deleted. This invariant breaks because the enactor’s operation isn’t executed as one atomic step but split into three subactions for performance reasons. Splitting the operation allows parallelism and avoids long blocking while waiting for slow parts—such as applying configuration changes—to complete. This design trades atomicity for throughput and responsiveness.

Anyone familiar with concurrency control or distributed systems can foresee how this race condition unfolds and leads to a NeverDeleteActive violation. The root cause is a classic time-of-check to time-of-update flaw. 

The trace goes as follows. The planner issues several consecutive plans. Enactor 1 picks up one but lags on the next two steps. Enactor 2 arrives and moves faster: it takes the next plan, applies it, and performs cleanup. Because the staleness threshold isn’t yet reached, Enactor 1’s plan survives this cleanup. On the next round, Enactor 2 gets another plan and applies it. Only then does Enactor 1 finally execute its Apply step, overwriting Enactor 2’s newly installed plan. When Enactor 2’s cleanup runs again, it deletes that plan (which is now considered stale) violating the invariant NeverDeleteActiveRecord. 

You can explore this violation trace using a browser-based TLA+ trace explorer that Will Schultz built by following this link. The Spectacle tool loads the TLA+ spec from GitHub, interprets it using JavaScript interpreter, and shows/visualizes step-by-step state changes. You can step backwards and forwards using the buttons, and explore enabled actions. This makes model outputs accessible to engineers unfamiliar with TLA+. You can share a violation trace simply by sending a link as I did above.

Surprise with innodb_doublewrite_pages in MySQL 8.0.20+

In a recent post, The Quirks of Index Maintenance in Open Source Databases, I compared the IO load generated by open source databases while inserting rows in a table with many secondary indexes. Because of its change buffer, InnoDB was the most efficient solution. However, that’s not the end of the story. Evolution of the […]

November 04, 2025