What is a Metastable Failure?
Most outages follow a familiar script: something breaks, engineers fix it, the system recovers. The cause and the effect are temporally coupled — remove the trigger, and the problem disappears. But there is a more dangerous, rarer class of failure where this assumption completely breaks down.
A metastable failure is a persistent degraded state that outlives its original cause. A trigger — a traffic spike, a network hiccup, a cache flush — pushes the system past a threshold. What makes it metastable is what happens next: a sustaining effect kicks in, typically a feedback loop involving work amplification or reduced efficiency, that keeps the system pinned in the bad state even after the trigger is long gone.
It is common for an outage that involves a metastable failure to be initially blamed on the trigger, but the true root cause is the sustaining effect.
Paradoxically, as Bronson et al. document in their HotOS '21 paper, the root cause of these failures is often features that were built specifically to improve system efficiency or reliability — caching, retries, connection pooling, failover. The very mechanisms designed to make systems robust become the mechanisms that make failure self-perpetuating.
The Three-Phase Lifecycle
The lifecycle begins in a stable state. As load increases, the system crosses an implicit, invisible threshold into the vulnerable state — still serving traffic normally, but now fragile. Many production systems deliberately operate here permanently, because the vulnerable state is far more efficient than the stable state. Then a trigger kicks the system into the metastable failure state, where a feedback loop prevents any natural recovery.
Leaving the metastable state requires a strong corrective push: rebooting, dramatically reducing load, or changing a core policy. Simply removing the trigger does nothing, because the trigger is no longer driving the failure.
What is a Look-Aside Cache?
A look-aside cache (also called cache-aside or lazy-loading) is the dominant caching pattern in production systems. The application looks aside — toward a fast key-value store like Memcached or Redis — before hitting the primary database. The cache is a peer to the database, not a proxy in front of it.
The Read Path
On every read, the application checks the cache first. A cache hit returns the value immediately and skips the database. A cache miss falls through to the database — the result is then written into the cache for future requests. Writes typically invalidate the relevant cache entry. The application retains full control of caching logic; there is no intermediary to configure or operate.
The efficiency gains can be dramatic. A well-tuned look-aside cache can eliminate the vast majority of database queries. A database that would otherwise be crushed at 3,000 QPS operates comfortably at 300 QPS because the cache absorbs the other 90%. The system scales to 10× without changing the database at all.
But this is precisely what creates the vulnerability. The database is never sized for full traffic. It is sized for cache-miss traffic. That distinction is invisible during normal operation, and catastrophic when the cache fails.
How the Cache Creates a Metastable Trap
Consider the concrete example from the paper. Under normal operation, the numbers look healthy:
The system is healthy, but there is a hidden danger. The database can only handle around 300 QPS on its own. Any load above that is already in the vulnerable state. The system is operating at 10× its hidden capacity, sustained entirely by the continued existence of the cache.
The paper introduces a crucial distinction here — one that exposes the real risk:
The Trigger: Cache is Flushed
Now the cache is lost. A deployment misfires. Memcached runs out of memory. A routine maintenance window goes sideways. The reason barely matters — the effect is immediate and total: the cache is empty, the hit rate drops to zero, and all 3,000 QPS of traffic falls directly onto a database provisioned for 300.
The low cache hit rate leads to slow database responses, which prevents filling the cache. Losing a cache with a 90% hit-rate causes a 10× query amplification.
The trigger — the cache flush — is already gone. The cache could be back online, empty, and it would not matter. The feedback loop has taken ownership of the failure. The system cannot self-repair because the repair mechanism (writing to the cache) is downstream of the broken component (the database). You cannot fill the cache until the database responds. The database cannot respond until the cache is filled. This is the trap.
If your database could handle 3,000 QPS, a cache loss would be a bad day but not a crisis — performance degrades, then recovers as the cache warms up. The metastable failure only occurs because of the capacity mismatch: the database is provisioned for 300 QPS, and the normal operating load is 3,000 QPS. The 10× gap is the vulnerability. Closing that gap — even partially — transforms a potential metastable collapse into a survivable degradation.
Detecting the Failure: Characteristic Metrics
The paper introduces a practical diagnostic concept: the characteristic metric — a signal that spikes when the trigger fires and stays elevated for the entire duration of the metastable failure, only returning to normal once recovery is complete. Monitoring a characteristic metric gives you a window into the state of the feedback loop itself, not just the surface symptoms.
For the look-aside cache failure, the natural characteristic metrics are database latency and request timeout rate. Both spike when the cache is lost. Neither recovers until the system escapes the metastable state. An alert on either — combined with a simultaneous collapse in cache hit rate — is a strong signal that you are not dealing with a transient spike.
Other characteristic metrics observed in production across different failure modes:
Queueing delay is particularly valuable because it is more resilient to workload changes than raw QPS numbers. A small minimum queueing delay over a sliding window means the queue drained at some point recently — even a large queue is probably a manageable spike. A persistently large minimum signals structural overload, not transient noise. This is the CoDel insight applied to internal work queues.
Breaking the Loop
The paper is emphatic on one point: treat the sustaining effect as the root cause, not the trigger. There are many potential triggers for any given metastable failure; addressing one of them does not prevent the next. Addressing the feedback loop itself does. Here are the approaches that work:
The paper flags a subtle incentive problem. An improved cache eviction algorithm reduces average database load, which makes it tempting to reclaim database resources. This looks like a win on every standard metric. But it widens the gap between advertised and hidden capacity — making the system more efficient in the common case and more catastrophic when the cache fails. Organizations that reward capacity reduction without measuring hidden capacity will keep making this mistake. Incentivizing reductions in cold cache misses, by contrast, yields a true capacity win — because it raises hidden capacity without reducing headroom.
The Deeper Lesson
Metastable failures are hard precisely because they violate the mental model most engineers carry: fix the cause, the effect goes away. The look-aside cache case is a clean illustration of why that model fails. The cause is a cache flush. The sustained outage is caused by a feedback loop the flush merely triggered — and the loop runs indefinitely on its own logic.
These failures behave as black swan events — seemingly impossible until they happen, trivially explainable in hindsight. None of the cases in the HotOS paper were identified ahead of time. Some recurred over months to years before being fully resolved. One involving link imbalance defied explanation for more than two years and was ultimately fixed by changing a single line: the connection pool's eviction policy.
The look-aside cache failure illustrates what the paper terms efficiency dependency: the system is not actually capable of handling the traffic it claims to serve. It is capable of handling what the cache allows through. Every optimization that increases the gap between these two numbers increases the blast radius when the cache fails. The cache is not just a performance feature. It is a load-bearing wall — and most teams never discover that until the wall comes down.
A systematic approach for building systems that are robust against unknown metastable failures remains an open problem.