I don't know what «domain» or «in-distribution» mean anymore. Obviously LLMs generalize beyond specific examples. Is this literally about latent representations being anchored to specific tokens, kind of how people internally translate things to the first language they learn?
steve hsu
steve hsu10.8. klo 20.06
Musk: Steve, the real question I keep asking the team is whether today’s LLMs can reason when they leave the training distribution. Everyone cites chain-of-thought prompts, but that could just be mimicry. Hsu: Agreed. The latest benchmarks show that even Grok4-level models degrade sharply once you force a domain shift — the latent space just doesn’t span the new modality. Musk: So it’s more of a coverage problem than a reasoning failure? Hsu: Partly. But there’s a deeper issue. The transformer’s only built-in inductive bias is associative pattern matching . When the prompt is truly out-of-distribution—say, a symbolic puzzle whose tokens never co-occurred in training—the model has no structural prior to fall back on. It literally flips coins. Musk: Yet we see emergent “grokking” on synthetic tasks. Zhong et al. showed that induction heads can compose rules they were never explicitly trained on. Doesn’t that look like reasoning? Hsu: Composition buys you limited generalization, but the rules still have to lie in the span of the training grammar. As soon as you tweak the semantics—change a single operator in the puzzle—the accuracy collapses. That’s not robust reasoning; it’s brittle interpolation. Musk: Couldn’t reinforcement learning fix it? DRG-Sapphire used GRPO on top of a 7 B base model and got physician-grade coding on clinical notes, a classic OOD task. Hsu: The catch is that RL only works after the base model has ingested enough domain knowledge via supervised fine-tuning. When the pre-training corpus is sparse, RL alone plateaus. So the “reasoning” is still parasitic on prior knowledge density. Musk: So your takeaway is that scaling data and parameters won’t solve the problem? We’ll always hit a wall where the next OOD domain breaks the model? Hsu: Not necessarily a wall, but a ceiling. The empirical curves suggest that generalization error decays roughly logarithmically with training examples . That implies you need exponentially more data for each new tail distribution. For narrow verticals—say, rocket-engine diagnostics—it’s cheaper to bake in symbolic priors than to scale blindly. Musk: Which brings us back to neuro-symbolic hybrids. Give the LLM access to a small verified solver, then let it orchestrate calls when the distribution shifts. Hsu: Exactly. The LLM becomes a meta-controller that recognizes when it’s OOD and hands off to a specialized module. That architecture sidesteps the “one giant transformer” fallacy. Musk: All right, I’ll tell the xAI team to stop chasing the next trillion tokens and start building the routing layer. Thanks, Steve. Hsu: Anytime. And if you need synthetic OOD test cases, my lab has a generator that’s already fooled GPT-5. I’ll send the repo. This conversation with Elon might be AI-generated.
3,54K