Metastable Failure: When the Cache Dies, So Does Everything Else

Part I — The Pattern

What is a Metastable Failure?

Most outages follow a familiar script: something breaks, engineers fix it, the system recovers. The cause and the effect are temporally coupled — remove the trigger, and the problem disappears. But there is a more dangerous, rarer class of failure where this assumption completely breaks down.

A metastable failure is a persistent degraded state that outlives its original cause. A trigger — a traffic spike, a network hiccup, a cache flush — pushes the system past a threshold. What makes it metastable is what happens next: a sustaining effect kicks in, typically a feedback loop involving work amplification or reduced efficiency, that keeps the system pinned in the bad state even after the trigger is long gone.

It is common for an outage that involves a metastable failure to be initially blamed on the trigger, but the true root cause is the sustaining effect.

Paradoxically, as Bronson et al. document in their HotOS '21 paper, the root cause of these failures is often features that were built specifically to improve system efficiency or reliability — caching, retries, connection pooling, failover. The very mechanisms designed to make systems robust become the mechanisms that make failure self-perpetuating.

The Three-Phase Lifecycle

States and transitions of a system experiencing a metastable failure

Stable

Triggers cause errors but resolve when removed. System self-heals naturally.

Increasing Load →

Vulnerable

System looks healthy. Can run here for months or years. One trigger away from collapse.

Trigger →

Metastable

Feedback loop sustains the failure. Trigger is gone. Outage continues indefinitely.

The lifecycle begins in a stable state. As load increases, the system crosses an implicit, invisible threshold into the vulnerable state — still serving traffic normally, but now fragile. Many production systems deliberately operate here permanently, because the vulnerable state is far more efficient than the stable state. Then a trigger kicks the system into the metastable failure state, where a feedback loop prevents any natural recovery.

Leaving the metastable state requires a strong corrective push: rebooting, dramatically reducing load, or changing a core policy. Simply removing the trigger does nothing, because the trigger is no longer driving the failure.

Part II — The Architecture

What is a Look-Aside Cache?

A look-aside cache (also called cache-aside or lazy-loading) is the dominant caching pattern in production systems. The application looks aside — toward a fast key-value store like Memcached or Redis — before hitting the primary database. The cache is a peer to the database, not a proxy in front of it.

The Read Path

On every read, the application checks the cache first. A cache hit returns the value immediately and skips the database. A cache miss falls through to the database — the result is then written into the cache for future requests. Writes typically invalidate the relevant cache entry. The application retains full control of caching logic; there is no intermediary to configure or operate.

The efficiency gains can be dramatic. A well-tuned look-aside cache can eliminate the vast majority of database queries. A database that would otherwise be crushed at 3,000 QPS operates comfortably at 300 QPS because the cache absorbs the other 90%. The system scales to 10× without changing the database at all.

But this is precisely what creates the vulnerability. The database is never sized for full traffic. It is sized for cache-miss traffic. That distinction is invisible during normal operation, and catastrophic when the cache fails.

Part III — The Failure

How the Cache Creates a Metastable Trap

Consider the concrete example from the paper. Under normal operation, the numbers look healthy:

3,000

App-layer QPS

90%

Cache hit rate

300

DB QPS (normal)

The system is healthy, but there is a hidden danger. The database can only handle around 300 QPS on its own. Any load above that is already in the vulnerable state. The system is operating at 10× its hidden capacity, sustained entirely by the continued existence of the cache.

The paper introduces a crucial distinction here — one that exposes the real risk:

Hidden Capacity

300 QPS

The load at which the system will self-heal even if a trigger fires. Determined by the database's unassisted limit. Difficult to measure during normal operation.

Advertised Capacity

3,000 QPS

The load the system handles normally with the cache active. The number engineers rely on. 10× the hidden capacity — entirely dependent on the cache never failing.

The Trigger: Cache is Flushed

Now the cache is lost. A deployment misfires. Memcached runs out of memory. A routine maintenance window goes sideways. The reason barely matters — the effect is immediate and total: the cache is empty, the hit rate drops to zero, and all 3,000 QPS of traffic falls directly onto a database provisioned for 300.

The Self-Sustaining Feedback Loop

01 Cache is flushed. Hit rate collapses to ~0%. All 3,000 QPS now reach the database — 10× its capacity.

↓ immediately causes

02 Database is overwhelmed. Latency spikes from milliseconds to seconds. The database cannot drain its queue.

↓ which causes

03 Application requests time out. The web app waits for database responses that never come in time. Goodput drops to zero.

↓ which causes

04 Cache cannot be repopulated. In look-aside, the cache is only written after a successful database read. Timeouts mean no successful reads — no successful reads mean no cache writes.

↓ which returns to

05 Hit rate stays at zero. Every new request is a cache miss. The database receives full load. The loop continues — indefinitely.

The low cache hit rate leads to slow database responses, which prevents filling the cache. Losing a cache with a 90% hit-rate causes a 10× query amplification.

The trigger — the cache flush — is already gone. The cache could be back online, empty, and it would not matter. The feedback loop has taken ownership of the failure. The system cannot self-repair because the repair mechanism (writing to the cache) is downstream of the broken component (the database). You cannot fill the cache until the database responds. The database cannot respond until the cache is filled. This is the trap.

Why this is not just a performance degradation

If your database could handle 3,000 QPS, a cache loss would be a bad day but not a crisis — performance degrades, then recovers as the cache warms up. The metastable failure only occurs because of the capacity mismatch: the database is provisioned for 300 QPS, and the normal operating load is 3,000 QPS. The 10× gap is the vulnerability. Closing that gap — even partially — transforms a potential metastable collapse into a survivable degradation.

Part IV — Diagnosis

Detecting the Failure: Characteristic Metrics

The paper introduces a practical diagnostic concept: the characteristic metric — a signal that spikes when the trigger fires and stays elevated for the entire duration of the metastable failure, only returning to normal once recovery is complete. Monitoring a characteristic metric gives you a window into the state of the feedback loop itself, not just the surface symptoms.

For the look-aside cache failure, the natural characteristic metrics are database latency and request timeout rate. Both spike when the cache is lost. Neither recovers until the system escapes the metastable state. An alert on either — combined with a simultaneous collapse in cache hit rate — is a strong signal that you are not dealing with a transient spike.

Other characteristic metrics observed in production across different failure modes:

DB latency

Cache hit rate

Timeout rate

Queueing delay

Thread counts

Connection counts

Working set size

Lock contention

Page faults / swap

Queueing delay is particularly valuable because it is more resilient to workload changes than raw QPS numbers. A small minimum queueing delay over a sliding window means the queue drained at some point recently — even a large queue is probably a manageable spike. A persistently large minimum signals structural overload, not transient noise. This is the CoDel insight applied to internal work queues.

Part V — Mitigations

Breaking the Loop

The paper is emphatic on one point: treat the sustaining effect as the root cause, not the trigger. There are many potential triggers for any given metastable failure; addressing one of them does not prevent the next. Addressing the feedback loop itself does. Here are the approaches that work:

01 — Architecture

Switch to Read-Through

A read-through cache sits in front of the database and populates itself, even if the web app times out. The app abandons the request, but the cache still fills — steadily increasing hit rate until the system heals. Look-aside cannot enforce this priority. Read-through makes it structural.

02 — Policy

Load Shedding & Circuit Breakers

When the database is overwhelmed, accepting more traffic deepens the hole. A circuit breaker that sheds excess load gives the database breathing room — successful responses that can then populate the cache. LIFO scheduling during overload lets some requests meet their deadline rather than all failing.

03 — Operations

Controlled Cache Warming

Rather than relying on organic repopulation (which fails under load), a warming script pre-populates the cache from a read replica or snapshot before it is brought back into the serving path. This decouples cache writes from live request latency — breaking the causal chain that sustains the failure.

04 — Provisioning

Reduce the Capacity Gap

A database provisioned for 1,500 QPS rather than 300 QPS cuts the hidden-to-advertised gap from 10× to 2×. The cache still provides enormous value, but the system no longer depends on perfect cache behavior to survive. Shrinking the gap is the most durable mitigation.

The Organizational Trap

The paper flags a subtle incentive problem. An improved cache eviction algorithm reduces average database load, which makes it tempting to reclaim database resources. This looks like a win on every standard metric. But it widens the gap between advertised and hidden capacity — making the system more efficient in the common case and more catastrophic when the cache fails. Organizations that reward capacity reduction without measuring hidden capacity will keep making this mistake. Incentivizing reductions in cold cache misses, by contrast, yields a true capacity win — because it raises hidden capacity without reducing headroom.

Closing

The Deeper Lesson

Metastable failures are hard precisely because they violate the mental model most engineers carry: fix the cause, the effect goes away. The look-aside cache case is a clean illustration of why that model fails. The cause is a cache flush. The sustained outage is caused by a feedback loop the flush merely triggered — and the loop runs indefinitely on its own logic.

These failures behave as black swan events — seemingly impossible until they happen, trivially explainable in hindsight. None of the cases in the HotOS paper were identified ahead of time. Some recurred over months to years before being fully resolved. One involving link imbalance defied explanation for more than two years and was ultimately fixed by changing a single line: the connection pool's eviction policy.

The look-aside cache failure illustrates what the paper terms efficiency dependency: the system is not actually capable of handling the traffic it claims to serve. It is capable of handling what the cache allows through. Every optimization that increases the gap between these two numbers increases the blast radius when the cache fails. The cache is not just a performance feature. It is a load-bearing wall — and most teams never discover that until the wall comes down.

A systematic approach for building systems that are robust against unknown metastable failures remains an open problem.