The GAS Theorem:
A Fundamental Trilemma
in AI System Design

On the impossibility of simultaneously optimizing Generality, Accuracy,
and Speed in deployed artificial intelligence systems

Author Md Omar Faroque

Published March 14, 2026

Keywords AI systems, trilemma, inference, model design

We present the GAS Theorem — a fundamental trilemma governing the design of deployed artificial intelligence systems. Drawing analogy to Brewer's CAP Theorem in distributed systems, we assert that no AI system can simultaneously achieve all three of: Speed (low-latency inference), Generality (broad task coverage), and Accuracy (reliable correctness). Any practical system is constrained to optimize for at most two. We characterize the three resulting design archetypes, provide empirical grounding in contemporary model families, and discuss implications for practitioners building AI-native products.

§1 Motivation

In 2000, Eric Brewer conjectured — and Gilbert and Lynch later formally proved — that a distributed data store cannot simultaneously guarantee Consistency, Availability, and Partition Tolerance. This insight, known as the CAP Theorem, became foundational to how engineers reason about distributed system tradeoffs. It gave practitioners a precise vocabulary for architectural decisions that had previously been made by intuition alone.

As artificial intelligence transitions from research curiosity to production infrastructure, we find ourselves at a similar inflection point. Practitioners building AI systems are constantly negotiating tradeoffs they struggle to articulate cleanly. We propose that these tradeoffs, too, can be crystallized into a single theorem — one that will prove equally useful for the next generation of system designers.

Every sufficiently general theory of constraints eventually becomes the lens through which an entire generation of engineers sees their work.
— On the nature of architectural theorems

§2 The Three Properties

We define each property precisely before asserting their mutual incompatibility.

Speed (S)

A system exhibits Speed if it can produce a response within a latency budget appropriate for interactive or real-time use. In practice, this means sub-second time-to-first-token for most user-facing applications, or the ability to process high volumes of requests without queuing bottlenecks. Speed is a function of both model architecture (parameter count, layer depth, attention complexity) and inference infrastructure (hardware, batching, quantization).

Generality (G)

A system exhibits Generality if it can competently handle a broad, unbounded distribution of tasks without requiring task-specific tuning. A general model performs adequately on novel prompts, domain-crossing queries, and compositional reasoning challenges it was not explicitly trained for. Generality is fundamentally about the breadth of the model's learned world representation.

Accuracy (A)

A system exhibits Accuracy if its outputs are reliably correct — factually grounded, logically valid, and consistent across semantically equivalent inputs. Accuracy is not merely about avoiding hallucination; it encompasses calibrated uncertainty, coherent multi-step reasoning, and reproducibility under equivalent conditions.

Informal statement: No deployed AI system can simultaneously maximize Generality, Accuracy, and Speed. Optimizing for any two necessarily compromises the third.

Let G, A, S ∈ [0, 1] denote normalized measures of Generality, Accuracy,
and Accuracy for a deployed AI system M.

Then: max(S) + max(G) + max(A) cannot be simultaneously achieved.
For any system M: at least one of {S, G, A} is bounded below its theoretical maximum
given real constraints of compute, architecture, and optimization objectives.

The theorem does not preclude incremental improvements across all three axes over time — just as faster networks improved CAP system designs — but asserts that within any fixed resource envelope, the trilemma holds.

Figure 1. The GAS Trilemma. Any viable AI system occupies an edge of this triangle — optimizing for two vertices while conceding the third.

§3 The Three Archetypes

The GAS Theorem implies exactly three stable design archetypes, corresponding to the three edges of the trilemma triangle. We examine each in turn.

Archetype I

Fast + General

Small, general-purpose models optimized for broad coverage at low latency. Capable across many domains but unreliable on tasks demanding precision.

Sacrifices: Accuracy — outputs are plausible but frequently wrong on non-trivial tasks.

Archetype II

Fast + Accurate

Fine-tuned specialist models: medical coders, SQL generators, legal classifiers. Reliable and quick within their domain — brittle everywhere else.

Sacrifices: Generality — catastrophically fails on out-of-distribution inputs.

Archetype III

General + Accurate

Frontier models: GPT-4 class, Claude Opus, Gemini Ultra. Broad competence and high reliability — but inference is expensive, slow, and resource-intensive.

Sacrifices: Speed — unsuitable for real-time or high-volume applications without caching layers.

§4 Empirical Grounding

The three archetypes are not theoretical abstractions. They map directly to product categories observable in the AI ecosystem as of early 2026.

Archetype I is exemplified by quantized small models (7B–13B parameter range, 4-bit GGUF) running locally or at the edge. They respond in milliseconds and handle a remarkable variety of conversational tasks, yet fail measurably on multi-step reasoning benchmarks, arithmetic, and tasks requiring precise factual recall. The original GPT-3 (175B at launch, but architecturally optimized for throughput) showed similar characteristics.

Archetype II characterizes the entire fine-tuning industry: domain-specific models trained on medical literature, legal corpora, financial filings, or code repositories. These systems often outperform frontier models within their domain on latency-sensitive benchmarks, but are deliberately brittle by design — their specialization is their value, and their inability to generalize is the acceptable cost.

Archetype III describes every frontier API product that powers today's most capable AI applications. These systems routinely pass professional-level examinations, reason across modalities, and handle novel compositional tasks. Their p95 time-to-first-token measured in seconds, not milliseconds, makes them inappropriate for real-time interactive voice, low-latency gaming AI, or high-frequency trading signal generation.

§5 Relationship to CAP

The structural analogy to Brewer's CAP Theorem is deliberate and instructive. We compare the two frameworks directly:

Dimension	CAP Theorem	GAS Theorem
Domain	Distributed data stores	AI inference systems
Three properties	Consistency, Availability, Partition Tolerance	Generality, Accuracy, Speed
Forcing event	Network partition	Fixed compute / resource envelope
Viable archetypes	CP systems, AP systems (CA is impractical)	SG, SA, GA systems
Practical escape	Tunable consistency (PACELC)	Mixture-of-experts, cascading routers
Implication	Choose your failure mode deliberately	Choose your constraint deliberately

One important distinction: CAP's partition tolerance is binary — a network partition either occurs or it doesn't. The GAS properties exist on a continuum. This makes the GAS Theorem somewhat softer than CAP in formal terms, but no less useful as a design lens. Engineers rarely need to reason at the formal boundary; they need to reason about the direction of their tradeoffs.

§6 Apparent Escapes and Their Limits

Critics will immediately point to techniques that appear to evade the trilemma. We address the most common.

Mixture-of-Experts (MoE)

MoE architectures route each token to a subset of specialized sub-networks, achieving apparent generality without activating all parameters for every query. This improves the Speed–Generality frontier meaningfully. However, MoE systems still sacrifice Accuracy on tasks that require deep, sustained computation across many reasoning steps — exactly the tasks where sparse activation is least helpful.

Speculative Decoding and Cascading

Routing simple queries to small, fast models and escalating complex ones to larger models — a technique increasingly used in production — approximates the Archetype III profile at lower average latency. But this requires accurate query complexity estimation, and hard queries still pay the full frontier latency cost. The GAS constraint is relaxed at the system level only by accepting that Accuracy suffers on queries misclassified as simple.

Hardware Improvements

As with CAP, hardware progress shifts the feasible frontier outward. Faster chips, better memory bandwidth, and improved quantization all move the operating point toward the center of the triangle. But they do not eliminate the triangle. They merely make it larger. The relative tradeoffs between archetypes persist.

§7 Implications for Practitioners

The GAS Theorem, if accepted, has immediate practical consequences for anyone building AI-native products.

Name your sacrifice before you start. Just as CAP forced database engineers to explicitly choose their consistency model, GAS forces AI architects to explicitly choose which property they will underinvest in. A customer support chatbot that must respond in 200ms is an SG system; acknowledge that its accuracy will disappoint on hard edge cases and design the human escalation path accordingly.

Match the archetype to the use case. Medical diagnosis assistance is a GA system — accuracy and generality are non-negotiable, latency can be measured in seconds. High-frequency trading signal generation is an SA system — speed and accuracy within a narrow domain, generality is irrelevant. Code completion in an IDE is an SG system — milliseconds matter, and occasional wrong suggestions are tolerable.

Treat the trilemma as a communication tool. Product managers, engineers, and executives often argue past each other because they are implicitly optimizing for different vertices. Making the GAS constraint explicit surfaces these disagreements early, when they are cheapest to resolve.

§8 Conclusion

We have proposed the GAS Theorem: that no AI system can simultaneously maximize Generality, Accuracy, and Speed, and that any practical deployment represents a deliberate choice among three viable archetypes. We have grounded the theorem in the contemporary model landscape, drawn a careful analogy to CAP, addressed apparent counterexamples, and outlined implications for practitioners.

The value of such a theorem is not in its mathematical formality but in the clarity it provides. Brewer's CAP Theorem did not prevent engineers from building distributed systems — it gave them a shared vocabulary for talking about the systems they were already building. We hope the GAS Theorem does the same for the generation of engineers now building the AI-native products that will define the coming decade.

The goal is not to escape the triangle. The goal is to know which side you're standing on — and to stand there with intention.

References

[1] Brewer, E. A. (2000). Towards Robust Distributed Systems. PODC Keynote.

[2] Gilbert, S., & Lynch, N. (2002). Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services. ACM SIGACT News, 33(2), 51–59.

[3] Abdin, M., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219.

[4] Jiang, A. Q., et al. (2024). Mixtral of Experts. arXiv:2401.04088.

[5] Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.