Skip to main content

Science as Verifiable Search

What is science?

Generate hypotheses. Test them against observation. Revise or discard based on results. Repeat.

This is search. But the hypothesis space is vast—too complex for brute force. You need intelligent pruning.

This isn’t mystical. Occam’s razor tells us to prefer simpler explanations, and Solomonoff induction formalizes why: in the space of all possible hypotheses, shorter descriptions (simpler theories) should get higher prior probability. The bias toward simplicity isn’t arbitrary—it’s the theoretically optimal way to navigate an infinite hypothesis space. Intelligence guides the search by embodying this prior, using intuition, analogy, and prior knowledge to focus on hypotheses that are likely to be true.

But intelligence alone isn’t enough. You can’t reason your way to truth about the world without touching the world. Empirical testing provides the signal. Not “verification” exactly—science doesn’t prove things true. It fails to disprove them. Observations must be compatible with the theory, and theories that survive repeated attempts at falsification gain credence.

Intelligence and empirical testing are complementary. Intelligence prunes the space—you can’t test everything. Testing provides the signal—you can’t think your way to truth. Both essential, neither sufficient.

This is the scientific method. It’s domain-independent. What varies is the cost of testing.


The Testing Bottleneck

In software, tests run in milliseconds. In materials simulation, seconds to hours. In biology, months. In climate science, decades. In medicine, years and millions of dollars.

This cost structure determines where progress happens fastest. Not because some fields are “easier”—but because the testing loop is tighter. Faster feedback means faster iteration. Faster iteration means faster convergence on truth.

Current AI systems can already participate in the scientific loop. They can generate hypotheses, design experiments, analyze results, revise and iterate. What limits them is the same thing that limits humans: how fast and cheap can you test?

In domains with tight feedback loops—code, simulations, games—AI systems already accelerate discovery. In domains where testing is slow or expensive, they hit the same wall we do.


Synthetic Experiments

Here’s the key move: build synthetic worlds.

Create a data-generating process where we know the underlying laws, but the AI does not. The system sees only probes, measurements, noise, constraints—the same interface a scientist has with reality. From its perspective, it’s doing real science. It’s searching through hypothesis space, testing against evidence, revising.

But we can generate observations cheaply, because we control the ground truth.

The process is identical to what humans do. It’s just faster.

This isn’t speculative. It generalizes from things that already work:

AlphaZero didn’t train on human chess games. It played itself, in a world with perfect rules and instant feedback. The process was genuine—exploration, evaluation, improvement—just accelerated. By the time it finished, it had rediscovered and surpassed centuries of human chess understanding.

Bug injection in code works the same way: introduce known defects into working systems, train models to find them. Ground-truth labels at scale. The model learns to find bugs, not memorize specific patches.

Same principle: train on unknown worlds where discovery is the task.

Why does this work?

First, dense, cheap observations. Feedback loops that take years in the real world collapse to milliseconds. You can run millions of hypothesis-test cycles in the time it takes to run one wet lab experiment.

Second, procedural generation avoids memorization. Generate entire families of worlds: different causal graphs, different latent variables, different noise structures, different intervention affordances. The system can’t memorize specific answers. It has to learn strategies for inquiry.

Third—and this is the crucial part—you’re training process, not facts.

The system learns: how to choose informative experiments. How to update beliefs efficiently given evidence. How to detect when assumptions break. How to trade off exploration versus exploitation.

These are transferable inquiry strategies. They’re not about knowing the right equation. They’re about knowing how to find equations.

You could call these unit tests for discovery.


The Traps

A few reasons to stay skeptical:

Synthetic worlds are biased. If your data-generating processes are too clean—perfect measurements, no confounders, no missing data, no cost constraints—you train on “science in heaven.” Real science is messier. Real instruments drift and break. Real data has gaps and errors. You’d need to inject friction deliberately: partial observability, measurement error, resource constraints, adversarial correlations that mimic the actual texture of empirical work.

The distribution gap is real. An agent that excels at discovering laws in simulated physics may struggle in wet biology. Transfer is not guaranteed. The invariants of inquiry—experiment design, hypothesis refinement, anomaly detection—may or may not carry across domains. We don’t know yet.

Novelty is the hard part. Optimization is easy to reward. Extension of known frameworks is easy to reward. True conceptual reframing—the kind that redraws the map rather than filling it in—is rare, hard to define, and therefore hard to train. Systems built this way might get very good at fast exploration without ever producing a genuine paradigm shift.

But note: humans train on toy problems too. Physics students learn on frictionless planes and spherical cows. Medical students learn on simulations before touching patients. The question is empirical: do inquiry strategies learned in synthetic worlds transfer to real ones? We could answer it.


What Changes

If this works even moderately well, the bottleneck shifts.

It shifts from data to world design. The question becomes: have we built training worlds that capture the hardness of real discovery? The limiting factor isn’t “do we have enough examples” but “have we specified the problem faithfully.”

It shifts from individual insight to what we choose to measure. Progress stops being limited by one brilliant person having a breakthrough. It starts being limited by whether we’ve specified objectives that don’t reward gaming.

There’s something unsettling here. Humans are notoriously bad at noticing when the map has quietly changed. If AI systems generate results that are locally valid—each prediction checks out, each experiment replicates—but globally reorienting, we may accept them piecemeal. The conceptual shift might happen gradually, almost bureaucratically, as a thousand small correct predictions accumulate into a new understanding that no one explicitly authored.

For now, humans still frame problems, choose objectives, and judge outputs. How long that remains true is an open question. But the process itself—search guided by intelligence, tested against evidence—doesn’t require human operators. AI systems can do exactly what we do. Synthetic experiments just accelerate the testing loop.


Once systems are trained on tightening the hypothesis-experiment-revision loop, what limits progress?

Not intelligence—that’s part of the loop. Not data in the traditional sense.

What limits progress is whether we can build training worlds that faithfully represent the hardness of real discovery. Whether the inquiry strategies learned there transfer to the domains we care about. Whether we can reward the right things.

A few questions I don’t have answers to:

  • What domains are “simulation-complete” enough for this to work? Where does the synthetic-real gap stay small enough for transfer?
  • What invariants of inquiry actually generalize? Can you learn “how to be a good scientist” in abstraction, or is it always domain-specific?
  • If novelty is hard to reward, can we train for it at all? Or do we just accelerate search and hope novelty emerges as a side effect of covering more ground?
  • How do we notice when the map has quietly redrawn itself?

Science is search. The loop is tightening. The question is what we’re searching for.

Discussion