The Stand-Up Comedian Theory of Intelligence

TL;DR
- The world is high-dimensional, but intelligence finds the sparse paths through it.
- Recent architecture work exploits sparsity at multiple levels, each improves efficiency and scaling.
-  AI systems are learning to walk those paths more efficiently. Progress is real, yet “comedian‑level” understanding remains unmet.

The Limits Of Your Language Mean The Limits Of Your World

The world is a high-dimensional space. An enormous number of possibilities exist at every moment: molecules could vibrate differently, sentences could be constructed in countless ways, proteins could fold into any of millions of configurations.

They don’t.

Every construct moves in its own characteristic way. An underlying structure, empirically shaped over billions of years, gives us the specific reality we observe. The universe, for all its apparent complexity, is sparse. It occupies a thin manifold winding through the infinite space of what could be.

Demis Hassabis has been articulating this view recently, and it reframes how we should think about intelligence and the architectures we build to approximate it. [1]

The Stand-Up Comedian as Compression Engine

Why do people love stand-up comedians?

A great comedian collects observations from the chaos of everyday life: relationships, politics, the absurdity of human behavior. Some get deeply philosophical. But the art isn’t in the collection. It’s in the compression. They project the sprawling, high-dimensional mess of reality into a few catchy jokes that a wide audience can digest and laugh about.

The punchline lands because it captures something true about the underlying structure. Something the audience recognizes but couldn’t articulate themselves. The comedian has found the sparse representation.

This is what intelligence does. It compresses. It projects. It finds the low-dimensional manifold hidden in high-dimensional experience.

Language Models as Stand-Up Comedians

The Transformer, that attention-based neural network running on classical Turing machines, is a stand-up comedian in this framing.

It emerged as a solution to machine translation. Then it gained traction because it learned the underlying structure of language well. Given enough examples, it discovered the sparse patterns that govern how humans communicate.

Self-attention with residual connections takes this projection further. The residual stream creates pathways between layers, letting the network progressively refine its compression of high-dimensional training data into increasingly useful representations.

But this is most likely not understanding.

Think about protein folding. A protein doesn’t “know” what structure it’s folding into. No blueprint. No intention. It follows the physical constraints of its environment, and those constraints, shaped by billions of years of evolution, guide it toward functional configurations. The protein folds because nature told it so.

A language model works similarly. It has learned to compress and project, to be a good comedian, but the jokes land for reasons the model cannot introspect.

Self-Attention as a Sparsity Problem

Self-attention is fundamentally a sparsity problem.

When we attend to a sequence, we’re not treating every token as equally relevant to every other token. We’re modeling the sparse structure of natural language: in a 100,000-token document, only a tiny fraction of token pairs have meaningful relationships.

The O(n²) complexity of attention isn’t a bug to eliminate. It’s the cost of searching for needles in a haystack. Recent theoretical work proves this rigorously: asymptotically accurate sub-quadratic attention is mathematically impossible for arbitrary inputs. [2]

The proof treats attention as an Entropic Optimal Transport problem. Queries are sources distributing attention mass. Keys are targets receiving that mass. Softmax emerges naturally as the solution.

Any sub-quadratic method must leave O(n²) query-key pairs uninspected. An adversary can always hide critical information in those uninspected pairs. Perfect compression requires exhaustive search.

But real-world language isn’t adversarial. It lies on a low-dimensional manifold. The comedian doesn’t need to process every possible joke, just the ones that resonate with the structure of human experience.

Practical sparse attention works not because we’ve solved the theoretical problem, but because we’re exploiting the sparsity that nature already provides.


Hyper-Connections: Learning the Residual Manifold [3]

If self-attention models the sparsity of language, residual connections model the sparsity of computation itself.

The standard residual connection (x + F(x)) assumes a fixed relationship between layers: the identity mapping plus whatever the layer computes. Why should this relationship be fixed?

Hyper-Connections (Zhu et al., ByteDance, ICLR 2025) make it learnable.

The Seesaw Problem

Traditional residual variants create an uncomfortable trade-off:

VariantStrengthWeakness
Pre-NormPrevents gradient vanishingCauses representation collapse; deep layers become redundant
Post-NormPrevents collapseReintroduces gradient vanishing

The network must choose between stable training and expressive depth.

Learnable Connectivity

Hyper-Connections expand the residual stream into n parallel paths and introduce learnable matrices:

Standard:  x_{l+1} = x_l + F(x_l)

Hyper:     H_{l+1} = H_res · H_l + H_post · F(H_pre · H_l)

Pre-Norm and Post-Norm turn out to be non-trainable special cases of this general framework. Making the matrices learnable lets the network discover its own optimal connectivity.

What Emerges: The Λ-Pattern

Visualization reveals that trained networks learn a characteristic Λ-shaped pattern:

  • Strong connections to recent layers (like Post-Norm)
  • Strong connections to early layers (like Pre-Norm)
  • Weaker connections to middle layers

The network discovers a sparse connectivity structure combining the benefits of both approaches. It finds the minimal set of connections that capture essential information flow.

Engineering Reality

Overhead is negligible: +0.03% parameters, +0.2% FLOPs for a 1B model. On a 7B MoE model, Hyper-Connections achieve 1.8× faster convergence and +6 points on ARC-Challenge.

Manifold-Constrained Hyper-Connections: Stability Through Geometry

DeepSeek took Hyper-Connections further with mHC [4], addressing a flaw that emerges at scale.

Instability at 27B

At 27B parameters, unconstrained Hyper-Connections become catastrophically unstable. The residual mapping matrix H_res, applied across L layers, compounds:

Signal after L layers ∝ (H_res)^L

Without constraints, eigenvalues drift from 1.0. Signals explode when eigenvalues exceed 1, vanish when they fall below. At 27B scale, gradient norms spike to 3000× normal values.

The Birkhoff Polytope

mHC constrains H_res to be a doubly stochastic matrix: all entries non-negative, all rows and columns summing to 1. This places the matrix on the Birkhoff polytope, a geometric manifold with useful properties.

Multiplying by a doubly stochastic matrix cannot amplify signal magnitude. Products of doubly stochastic matrices remain doubly stochastic. And the space is rich enough for meaningful mixing patterns.

The projection uses the Sinkhorn-Knopp algorithm:

def sinkhorn_projection(M, iterations=20):
    M = torch.exp(M)
    for _ in range(iterations):
        M = M / M.sum(dim=1, keepdim=True)  # Row normalization
        M = M / M.sum(dim=0, keepdim=True)  # Column normalization
    return M

Engineering for Scale

DeepSeek implemented three optimizations:

Kernel Fusion. Multiple Sinkhorn iterations fused into single CUDA kernels.

Activation Recomputation. Intermediate values recomputed during backward pass, saving 40%+ memory.

DualPipe Communication Overlap. Network I/O hidden behind computation in pipeline-parallel training.

Result: stable training at 27B with only 6.7% time overhead. The Amax Gain Magnitude drops from ~3000 (unstable HC) to ~1.6 (stable mHC).


Engram: Bottom-Up Static Memory

Hyper-Connections improve how information flows through the network. Engram [5] addresses a different inefficiency: the network wastes dynamic computation on static knowledge.

The Problem

When predicting “Alexander the Great was king of ___”, the model doesn’t need complex reasoning. It needs to retrieve a fact. But Transformers use expensive attention and FFN operations to simulate what should be a simple lookup.

Conditional Memory

Engram introduces conditional memory as a new sparsity axis, complementary to MoE’s conditional computation:

Sparsity TypeMechanismCost
Conditional Computation (MoE)Route to different expertsO(k) per token
Conditional Memory (Engram)Hash N-grams into embedding tablesO(1) per token

The approach is bottom-up: local N-gram patterns deterministically index into massive static embedding tables. This captures the sparse, compositional structure of language at the lexical level.

Architecture

Input: "Alexander the Great was king of"

N-gram extraction: ["Alexander", "the Great", "was king", ...]

Multi-head hashing: hash_1(ngram), hash_2(ngram), ...

Table lookup: [e_1, e_2, e_3, ...] ← O(1) retrieval

Context-aware gating: gate(hidden_state) * retrieved_embedding

Add to residual stream

Context-aware gating matters here: the current hidden state modulates how much retrieved memory to use, suppressing irrelevant lookups.

The Sparsity Allocation Problem

There’s an optimal split between MoE and Engram capacity. Experiments reveal a U-shaped curve: pure MoE and pure Engram both underperform a hybrid with ~80% MoE / ~20% Engram allocation.

Effective Deepening

Engram doesn’t just store knowledge. It effectively deepens the network. By handling static pattern reconstruction in early layers, it frees computational depth for dynamic reasoning. Layer 8 of an Engram model is functionally equivalent to layer 12 of an MoE baseline.

Prefetching from Host Memory

Engram lookups are deterministic (based only on input text), so the system can prefetch from CPU RAM while the GPU computes previous layers:

GPU (Layer L-1):  [======== Compute ========]
CPU → GPU:           [--- Prefetch for L ---]
                     ↑ overlapped, zero stall

100B+ parameter tables in CPU RAM with <3% throughput penalty. This bypasses GPU memory limits entirely.

Results

ModelMMLUBBHHumanEvalMATH
MoE-27B67.251.345.128.4
Engram-27B70.656.351.833.7

Same parameters, same FLOPs, substantially better compression.


What Does Scaling Really Mean?

When we talk about scaling models, what do we actually want?

The model should become better at understanding the sparsity of natural language. It should find tighter compressions, more precise projections onto the manifold of meaningful text.

But we still don’t have a model that’s a good stand-up comedian.

We have good coders. Good mathematicians. These are domains where correctness is easily verifiable, where the sparse structure is explicit in formal rules. The model can learn to project onto these manifolds because we can tell it exactly when it’s wrong.

Comedy? Creativity? The kind of compression that makes a room full of strangers laugh at the same moment? That requires capturing something about the human condition that we can’t formalize. The sparse structure exists, but we don’t know how to supervise it.

The Pre-Training vs. Mid-Training Debate

This leads to a fundamental strategic question.

If you believe that model-level architecture is the primary driver of capability, pre-training should dominate your training budget.

The logic: a model with better architecture should perform better on the same training data. Invest in better attention mechanisms (sparse, hierarchical, dynamic), better connectivity (Hyper-Connections, mHC), better memory (Engram, retrieval augmentation), better sparsity allocation (MoE + conditional memory hybrids).

The architectural innovations in this post support this view. They achieve better performance with the same compute by improving how the model compresses and projects.

If you believe architectural gains are diminishing and model capability is plateauing, mid-training is the path forward.

This means longer context windows for better global compression. More data, but with higher quality and domain specificity. Task-specific fine-tuning stages. Better data curation and mixture strategies.

The emphasis shifts from how the model compresses to what it compresses.

The Pragmatic View

Both matter. Architecture determines the ceiling of compression quality. Data and compute determine how close you get to that ceiling.

The papers reviewed here suggest the architectural ceiling is still rising, at least from Deepseek’s perspective of view. And their innovations here share a common theme: exploiting sparsity at multiple levels.

LevelInnovationSparsity Exploited
AttentionSparse/Dynamic patternsToken-pair relevance
ConnectivityHyper-Connections, mHCLayer interaction structure
MemoryEngramStatic vs. dynamic knowledge
ExpertsMoEInput-dependent computation

Each finds a different dimension of the low-dimensional manifold that language occupies.

The goal remains the same: build a system that compresses reality as well as a great comedian does. We’re not there. We may be building systems that tell jokes without understanding why they’re funny.

The jokes are getting better, though.


References

  1. Hassabis, D. (2026). Transcript on intelligence, compression, and the manifold hypothesis, https://lexfridman.com/demis-hassabis-2-transcript
  2. Litman, E. (2025). Your Transformer is Secretly an EOT Solver, https://elonlit.com/scrivings/your-transformer-is-secretly-an-eot-solver/
  3. Zhu, D. et al. (2025). Hyper-Connections. ICLR 2025. ByteDance.
  4. Xie, Z. et al. (2026). mHC: Manifold-Constrained Hyper-Connections. arXiv:2512.24880. DeepSeek-AI.
  5. Cheng, X. et al. (2026). Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models. arXiv:2601.07372. DeepSeek-AI & Peking University.