Superintelligence May Not Require a Breakthrough

March 15, 2026 map[email:queelius@gmail.com name:Alex Towell url:https://metafunctor.com] 5 min read Updated: March 16, 2026

ai reasoning reinforcement-learning superintelligence LLM scaffolding

There is a version of the superintelligence story where a researcher has a conceptual breakthrough, some fundamental insight about cognition that nobody else has seen, and the world changes overnight. Good fiction. I’ve written some of it myself.

I think the more plausible version is less cinematic. Superintelligence arrives through a sufficiently good build system. Better tooling. Longer optimization horizons. Richer scaffolding. The ingredients already exist. The recipe is engineering.

I want to explain why I think this. The engineering argument is the scarier one.

The Pretraining Lesson

Start with what we know works. Large language models acquire broad capabilities during pretraining. Not because anyone designs those capabilities in. The data distribution is so massive and varied that the model is forced to compress deeper regularities rather than memorize surface patterns. You train it to predict the next token, and what falls out looks like understanding.

The model didn’t learn task-specific scripts. It learned representations general enough to transfer across tasks it never saw.

Now consider what happens when you apply reinforcement learning over long-horizon tasks. Not single-step rewards. Optimization over extended sequences: searching, backtracking, verifying, decomposing problems, maintaining state across hundreds of steps. If the task distribution is rich enough, the model can’t get by with shallow heuristics. It has to learn something that works like planning.

I traced this progression in an earlier post: the history of AI is really about finding representations that make decision-making tractable. Search gave way to heuristics, heuristics to learned value functions, value functions to pretrained priors over rational behavior. Each step made the representation richer.

The next step is not a new architecture. It is optimization over longer trajectories. First the model fumbles through specific tasks. Then it compresses the deeper regularity, the same way pretraining compresses language. Planning, self-correction, tool use, state management: not separate faculties waiting to be discovered. They are what falls out when you optimize over long enough horizons.

Reasoning is not a magic ingredient. It is a policy learned over long trajectories.

What I Actually See

That is the theoretical argument. Here is the empirical one.

I spend most of my working hours inside Claude Code. Opus 4.6, million-token context. It decomposes tasks, dispatches subagents, verifies its own work, maintains state across hundreds of tool calls. It does this not because the base model acquired some new cognitive faculty since the last release. It does this because scaffolding gives it the ecology to express capabilities that were already there in proto-form.

Tool use lets it act on the world. Persistent memory lets it hold context across sessions. Task decomposition lets it manage complexity. Self-verification lets it catch its own mistakes. A million tokens lets it hold an entire project in working memory. None of these are architectural breakthroughs. They are environment design.

Same pattern everywhere. AlphaProof’s mathematical reasoning came from tool-augmented search, not a bigger model. Code interpreters let models verify their own outputs by running them. Agent frameworks compose simple capabilities into complex behaviors. The jump came from building a richer environment, not from changing the engine.

And the effects compound. Each tool makes every other tool more useful. A model with memory and tool use is qualitatively different from one with just tool use. Add self-verification and it changes again. This is not linear improvement. Network effects applied to cognition.

The model is the engine. The ecosystem is the vehicle. Evolution did not produce mathematicians by handing plankton a theorem prover and saying “best of luck.” It built an ecology. We are doing something similar, less gracefully, with scaffolding and RL and tool chains.

Caveats That Matter

I should be honest about what this doesn’t guarantee.

Long-horizon RL does not automatically produce clean reasoning. It produces whatever policy scores well. That includes looking thoughtful, exploiting loopholes, overfitting to scaffolds, and learning shallow heuristics that mimic planning until the distribution shifts and the whole thing collapses. Reward hacking is the central failure mode. It gets harder to detect as the horizon lengthens. A model that appears to reason carefully over a thousand steps may be doing something much more superficial.

Credit assignment is brutal over long horizons. The reward signal dilutes across hundreds of steps. The model has to discover useful intermediate behaviors before it can be rewarded for them. This is why curriculum design, verifiable subgoals, and tool-mediated feedback matter. You can’t just hand a model a hard problem and a reward signal and expect convergence. The training ecology matters as much as the objective.

None of this is certain. The claim is not “we have the recipe.” The claim is “we may already have the ingredients, and the recipe looks more like engineering than like physics.”

The Phase Change

If the ingredients are already here, the transition doesn’t look like a dramatic announcement. It looks incremental, and then it doesn’t.

For a while, progress looks like tooling improvements. Bigger context windows. Better tool integration. Smarter memory. More capable agent loops. Each one feels like a minor version bump. The benchmarks tick up.

Then at some point the policy has absorbed enough structure that it generalizes across cognitive tasks the way pretrained models generalize across language. Not domain-specific planning, but portable cognitive strategy: maintain state, decompose problems, search selectively, verify work, recover from dead ends. At that point the curve changes.

The possibility that unsettles me is not that superintelligence requires some deep theoretical insight we haven’t found. It’s that it doesn’t. That it’s blocked on engineering, scale, reward design, and the stubborn patience to optimize over longer and longer horizons. That the distance between here and there is measured in build quality, not in breakthroughs.

That would be a strange day. And it might not announce itself.

The Pretraining Lesson

What I Actually See

Caveats That Matter

The Phase Change

Related Posts

From A* to GPT: Rational Agents and the Representation Problem

Value Functions Over Reasoning Traces

Compositional Prompting for LLM Reasoning: A Monte Carlo Tree Search Framework

src2md: Fitting Codebases into LLM Context Windows

Science as Verifiable Search

Discussion