The Stand-Up Comedian Theory of Intelligence
TL;DR
- The world is high-dimensional, but intelligence finds the sparse paths through it.
- Recent architecture work exploits sparsity at multiple levels, each improves efficiency and scaling.
- AI systems are learning to walk those paths more efficiently. Progress is real, yet “comedian‑level” understanding remains unmet.
The Limits Of Your Language Mean The Limits Of Your World
The world is a high-dimensional space. An enormous number of possibilities exist at every moment: molecules could vibrate differently, sentences could be constructed in countless ways, proteins could fold into any of millions of configurations.
They don’t.
Every construct moves in its own characteristic way. An underlying structure, empirically shaped over billions of years, gives us the specific reality we observe. The universe, for all its apparent complexity, is sparse. It occupies a thin manifold winding through the infinite space of what could be.
Demis Hassabis has been articulating this view recently, and it reframes how we should think about intelligence and the architectures we build to approximate it. [1]
The Stand-Up Comedian as Compression Engine
Why do people love stand-up comedians?
A great comedian collects observations from the chaos of everyday life: relationships, politics, the absurdity of human behavior. Some get deeply philosophical. But the art isn’t in the collection. It’s in the compression. They project the sprawling, high-dimensional mess of reality into a few catchy jokes that a wide audience can digest and laugh about.
The punchline lands because it captures something true about the underlying structure. Something the audience recognizes but couldn’t articulate themselves. The comedian has found the sparse representation.
This is what intelligence does. It compresses. It projects. It finds the low-dimensional manifold hidden in high-dimensional experience.
Language Models as Stand-Up Comedians
The Transformer, that attention-based neural network running on classical Turing machines, is a stand-up comedian in this framing.
It emerged as a solution to machine translation. Then it gained traction because it learned the underlying structure of language well. Given enough examples, it discovered the sparse patterns that govern how humans communicate.
Self-attention with residual connections takes this projection further. The residual stream creates pathways between layers, letting the network progressively refine its compression of high-dimensional training data into increasingly useful representations.
But this is most likely not understanding.
Think about protein folding. A protein doesn’t “know” what structure it’s folding into. No blueprint. No intention. It follows the physical constraints of its environment, and those constraints, shaped by billions of years of evolution, guide it toward functional configurations. The protein folds because nature told it so.
A language model works similarly. It has learned to compress and project, to be a good comedian, but the jokes land for reasons the model cannot introspect.
Self-Attention as a Sparsity Problem
Self-attention is fundamentally a sparsity problem.
When we attend to a sequence, we’re not treating every token as equally relevant to every other token. We’re modeling the sparse structure of natural language: in a 100,000-token document, only a tiny fraction of token pairs have meaningful relationships.
The O(n²) complexity of attention isn’t a bug to eliminate. It’s the cost of searching for needles in a haystack. Recent theoretical work proves this rigorously: asymptotically accurate sub-quadratic attention is mathematically impossible for arbitrary inputs. [2]
The proof treats attention as an Entropic Optimal Transport problem. Queries are sources distributing attention mass. Keys are targets receiving that mass. Softmax emerges naturally as the solution.
Any sub-quadratic method must leave O(n²) query-key pairs uninspected. An adversary can always hide critical information in those uninspected pairs. Perfect compression requires exhaustive search.
But real-world language isn’t adversarial. It lies on a low-dimensional manifold. The comedian doesn’t need to process every possible joke, just the ones that resonate with the structure of human experience.
Practical sparse attention works not because we’ve solved the theoretical problem, but because we’re exploiting the sparsity that nature already provides.
Hyper-Connections: Learning the Residual Manifold [3]
If self-attention models the sparsity of language, residual connections model the sparsity of computation itself.
The standard residual connection (x + F(x)) assumes a fixed relationship between layers: the identity mapping plus whatever the layer computes. Why should this relationship be fixed?
Hyper-Connections (Zhu et al., ByteDance, ICLR 2025) make it learnable.
The Seesaw Problem
Traditional residual variants create an uncomfortable trade-off:
| Variant | Strength | Weakness |
|---|---|---|
| Pre-Norm | Prevents gradient vanishing | Causes representation collapse; deep layers become redundant |
| Post-Norm | Prevents collapse | Reintroduces gradient vanishing |
The network must choose between stable training and expressive depth.
Learnable Connectivity
Hyper-Connections expand the residual stream into n parallel paths and introduce learnable matrices:
Standard: x_{l+1} = x_l + F(x_l)
Hyper: H_{l+1} = H_res · H_l + H_post · F(H_pre · H_l)
Pre-Norm and Post-Norm turn out to be non-trainable special cases of this general framework. Making the matrices learnable lets the network discover its own optimal connectivity.
What Emerges: The Λ-Pattern
Visualization reveals that trained networks learn a characteristic Λ-shaped pattern:
- Strong connections to recent layers (like Post-Norm)
- Strong connections to early layers (like Pre-Norm)
- Weaker connections to middle layers
The network discovers a sparse connectivity structure combining the benefits of both approaches. It finds the minimal set of connections that capture essential information flow.
Engineering Reality
Overhead is negligible: +0.03% parameters, +0.2% FLOPs for a 1B model. On a 7B MoE model, Hyper-Connections achieve 1.8× faster convergence and +6 points on ARC-Challenge.
Manifold-Constrained Hyper-Connections: Stability Through Geometry
DeepSeek took Hyper-Connections further with mHC [4], addressing a flaw that emerges at scale.
Instability at 27B
At 27B parameters, unconstrained Hyper-Connections become catastrophically unstable. The residual mapping matrix H_res, applied across L layers, compounds:
Signal after L layers ∝ (H_res)^L
Without constraints, eigenvalues drift from 1.0. Signals explode when eigenvalues exceed 1, vanish when they fall below. At 27B scale, gradient norms spike to 3000× normal values.
The Birkhoff Polytope
mHC constrains H_res to be a doubly stochastic matrix: all entries non-negative, all rows and columns summing to 1. This places the matrix on the Birkhoff polytope, a geometric manifold with useful properties.
Multiplying by a doubly stochastic matrix cannot amplify signal magnitude. Products of doubly stochastic matrices remain doubly stochastic. And the space is rich enough for meaningful mixing patterns.
The projection uses the Sinkhorn-Knopp algorithm:
def sinkhorn_projection(M, iterations=20):
M = torch.exp(M)
for _ in range(iterations):
M = M / M.sum(dim=1, keepdim=True) # Row normalization
M = M / M.sum(dim=0, keepdim=True) # Column normalization
return M
Engineering for Scale
DeepSeek implemented three optimizations:
Kernel Fusion. Multiple Sinkhorn iterations fused into single CUDA kernels.
Activation Recomputation. Intermediate values recomputed during backward pass, saving 40%+ memory.
DualPipe Communication Overlap. Network I/O hidden behind computation in pipeline-parallel training.
Result: stable training at 27B with only 6.7% time overhead. The Amax Gain Magnitude drops from ~3000 (unstable HC) to ~1.6 (stable mHC).
Engram: Bottom-Up Static Memory
Hyper-Connections improve how information flows through the network. Engram [5] addresses a different inefficiency: the network wastes dynamic computation on static knowledge.
The Problem
When predicting “Alexander the Great was king of ___”, the model doesn’t need complex reasoning. It needs to retrieve a fact. But Transformers use expensive attention and FFN operations to simulate what should be a simple lookup.
Conditional Memory
Engram introduces conditional memory as a new sparsity axis, complementary to MoE’s conditional computation:
| Sparsity Type | Mechanism | Cost |
|---|---|---|
| Conditional Computation (MoE) | Route to different experts | O(k) per token |
| Conditional Memory (Engram) | Hash N-grams into embedding tables | O(1) per token |
The approach is bottom-up: local N-gram patterns deterministically index into massive static embedding tables. This captures the sparse, compositional structure of language at the lexical level.
Architecture
Input: "Alexander the Great was king of"
↓
N-gram extraction: ["Alexander", "the Great", "was king", ...]
↓
Multi-head hashing: hash_1(ngram), hash_2(ngram), ...
↓
Table lookup: [e_1, e_2, e_3, ...] ← O(1) retrieval
↓
Context-aware gating: gate(hidden_state) * retrieved_embedding
↓
Add to residual stream
Context-aware gating matters here: the current hidden state modulates how much retrieved memory to use, suppressing irrelevant lookups.
The Sparsity Allocation Problem
There’s an optimal split between MoE and Engram capacity. Experiments reveal a U-shaped curve: pure MoE and pure Engram both underperform a hybrid with ~80% MoE / ~20% Engram allocation.
Effective Deepening
Engram doesn’t just store knowledge. It effectively deepens the network. By handling static pattern reconstruction in early layers, it frees computational depth for dynamic reasoning. Layer 8 of an Engram model is functionally equivalent to layer 12 of an MoE baseline.
Prefetching from Host Memory
Engram lookups are deterministic (based only on input text), so the system can prefetch from CPU RAM while the GPU computes previous layers:
GPU (Layer L-1): [======== Compute ========]
CPU → GPU: [--- Prefetch for L ---]
↑ overlapped, zero stall
100B+ parameter tables in CPU RAM with <3% throughput penalty. This bypasses GPU memory limits entirely.
Results
| Model | MMLU | BBH | HumanEval | MATH |
|---|---|---|---|---|
| MoE-27B | 67.2 | 51.3 | 45.1 | 28.4 |
| Engram-27B | 70.6 | 56.3 | 51.8 | 33.7 |
Same parameters, same FLOPs, substantially better compression.
What Does Scaling Really Mean?
When we talk about scaling models, what do we actually want?
The model should become better at understanding the sparsity of natural language. It should find tighter compressions, more precise projections onto the manifold of meaningful text.
But we still don’t have a model that’s a good stand-up comedian.
We have good coders. Good mathematicians. These are domains where correctness is easily verifiable, where the sparse structure is explicit in formal rules. The model can learn to project onto these manifolds because we can tell it exactly when it’s wrong.
Comedy? Creativity? The kind of compression that makes a room full of strangers laugh at the same moment? That requires capturing something about the human condition that we can’t formalize. The sparse structure exists, but we don’t know how to supervise it.
The Pre-Training vs. Mid-Training Debate
This leads to a fundamental strategic question.
If you believe that model-level architecture is the primary driver of capability, pre-training should dominate your training budget.
The logic: a model with better architecture should perform better on the same training data. Invest in better attention mechanisms (sparse, hierarchical, dynamic), better connectivity (Hyper-Connections, mHC), better memory (Engram, retrieval augmentation), better sparsity allocation (MoE + conditional memory hybrids).
The architectural innovations in this post support this view. They achieve better performance with the same compute by improving how the model compresses and projects.
If you believe architectural gains are diminishing and model capability is plateauing, mid-training is the path forward.
This means longer context windows for better global compression. More data, but with higher quality and domain specificity. Task-specific fine-tuning stages. Better data curation and mixture strategies.
The emphasis shifts from how the model compresses to what it compresses.
The Pragmatic View
Both matter. Architecture determines the ceiling of compression quality. Data and compute determine how close you get to that ceiling.
The papers reviewed here suggest the architectural ceiling is still rising, at least from Deepseek’s perspective of view. And their innovations here share a common theme: exploiting sparsity at multiple levels.
| Level | Innovation | Sparsity Exploited |
|---|---|---|
| Attention | Sparse/Dynamic patterns | Token-pair relevance |
| Connectivity | Hyper-Connections, mHC | Layer interaction structure |
| Memory | Engram | Static vs. dynamic knowledge |
| Experts | MoE | Input-dependent computation |
Each finds a different dimension of the low-dimensional manifold that language occupies.
The goal remains the same: build a system that compresses reality as well as a great comedian does. We’re not there. We may be building systems that tell jokes without understanding why they’re funny.
The jokes are getting better, though.
References
- Hassabis, D. (2026). Transcript on intelligence, compression, and the manifold hypothesis, https://lexfridman.com/demis-hassabis-2-transcript
- Litman, E. (2025). Your Transformer is Secretly an EOT Solver, https://elonlit.com/scrivings/your-transformer-is-secretly-an-eot-solver/
- Zhu, D. et al. (2025). Hyper-Connections. ICLR 2025. ByteDance.
- Xie, Z. et al. (2026). mHC: Manifold-Constrained Hyper-Connections. arXiv:2512.24880. DeepSeek-AI.
- Cheng, X. et al. (2026). Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models. arXiv:2601.07372. DeepSeek-AI & Peking University.