Does tree search help LLM reasoning? The literature can’t decide.
ReST-MCTS* says yes. AB-MCTS got a NeurIPS spotlight. “Limits of PRM-Guided Tree Search” says no: MCTS with a process reward model used 11x more tokens than best-of-N for zero accuracy gain. Snell et al. found beam search degrades performance on easy problems.
I built mcts-reasoning and ran controlled experiments to find where the boundary is. Total API cost: $0.48.
Setup
Four methods, same budget. Eight solution attempts per problem:
- Pass@1: One shot.
- Best-of-N: 8 independent solutions, verifier scores each, pick the best.
- Self-consistency: 8 solutions, majority vote.
- MCTS: 5 initial solutions scored by verifier, then 3 more guided by UCB1, informed by what worked and what failed.
Model: Claude Haiku 4.5. Problems: constraint satisfaction. Find integer values for variables satisfying simultaneous constraints. Example:
Find A, B, C, D, E, F satisfying ALL constraints:
1. A * D = 21
2. C = A + F
3. B * F = 20
4. E = A + 2
5. D + B = A
6. E - F = B
7. A mod 7 = 0
8. C mod 4 = 0
9. B * D = 12
10. C > E > A > F > B > D
11. A + B + C + D + E + F = 40
12. E * D = 27
The verifier is a Python function. Checks each constraint, returns the fraction satisfied. No LLM in the loop. Deterministic.
Calibration
Easy problems first (3-5 variables, 5-9 constraints). Haiku solved them in one pass. All methods tied at 100%.
5-variable problems with 9 constraints: Pass@1 dropped to 65%. Self-consistency failed one problem. But BestOfN still tied MCTS, because with 8 independent samples at least one is usually correct. BestOfN just picks it.
I needed problems where blind sampling hits a ceiling.
Results
Ten harder problems: 6-8 variables, 12-15 constraints. Products, modular arithmetic, ordering chains, cascading dependencies. Pass@1 dropped to 29%.
| Method | Solve Rate | Avg Score |
|---|---|---|
| Pass@1 | 29% | 0.29 |
| Pass@8 oracle | 90% | 0.90 |
| SC@8 | 90% | 0.90 |
| BestOfN@8 | 90% | 0.90 |
| MCTS(5+3) | 100% | 1.00 |
MCTS solved all 10. Every other method failed on one problem (v6_3).
v6_3 is a 6-variable, 12-constraint problem where none of 8 independent samples found the correct solution. Pass@8 oracle: 0/8. Self-consistency picks the most popular wrong answer. BestOfN picks the best wrong answer. Both fail.
MCTS sees that initial attempts satisfied 10/12 constraints but violated specific ones. UCB1 selects the most promising partial solution. The next attempt, informed by the failure pattern, satisfies all 12.
Total: $0.48. 180 API calls, about 190K tokens.
When MCTS Helps
The pattern across three rounds of experiments:
Easy problems (Pass@1 > 80%): No advantage. The model solves them.
Medium (Pass@1 40-70%): MCTS ties BestOfN. Blind sampling usually contains a correct solution. The verifier selects it.
Hard (Pass@1 < 30%): MCTS pulls ahead. When Pass@8 oracle is low, blind sampling can’t find the answer. MCTS’s informed exploration does.
The condition: MCTS adds value when independent sampling hits a ceiling and the verifier provides a gradient.
The gradient part matters. A binary pass/fail verifier says “wrong” but not how wrong. Partial credit (constraints satisfied / total) gives MCTS something to work with. The exploration phase sees what’s close and adjusts.
Why Self-Consistency Can’t Help
Self-consistency and UCB1 have a structural conflict.
Self-consistency rewards consensus. UCB1 rewards diversity: it explores undervisited branches precisely because they’re undervisited. Using self-consistency as a reward signal inside MCTS tells the tree to explore and converge at the same time. The exploration term pushes toward novel solutions. The consistency reward penalizes them.
On v6_3, all 8 samples failed. SC selected the most common failure mode. A per-path verifier doesn’t have this problem. Each solution is scored against the constraints independently. Good solutions propagate through the tree regardless of what other branches found.
I haven’t seen this conflict discussed in the literature. Most prior MCTS-for-LLM work uses per-path evaluation without explaining why self-consistency is absent.
What the Literature Says
These results fit a pattern.
Snell et al. (2024): compute-optimal test-time scaling needs difficulty-adaptive allocation. Easy problems need no search. Hard problems need search plus good verifiers.
“Limits of PRM-Guided Tree Search” (2025): PRM-guided MCTS fails to beat best-of-N because PRM quality degrades with depth. Noisy reward, no benefit from search.
“Don’t Get Lost in the Trees” (ACL 2025): verifier variance causes search pathologies. Deterministic verifiers avoid this.
Chen et al. (2024): 90% discriminator accuracy threshold for tree search to beat reranking. Deterministic constraint checkers hit 100%.
MCTS was built for games with perfect information. Chess and Go have deterministic reward signals. When the reward is noisy, the search can’t exploit it.
The Verification Asymmetry
The problems where MCTS helps share a structure: easy to verify, hard to solve.
A constraint satisfaction problem with 8 variables and 15 constraints is hard to solve. The LLM has to coordinate assignments across all variables simultaneously. Checking a proposed solution is trivial: evaluate each constraint, count violations.
This is the asymmetry that makes NP problems interesting. Checking a certificate is polynomial. Finding one is (presumably) not. It’s why search works for code generation (run the tests) and proof checking (verify the steps) but not for open-ended essay writing (no verifier).
The same asymmetry shows up at other levels. Reverse-Process Synthetic Data Generation exploits it for training data: run the easy direction (differentiation) to get solved examples of the hard direction (integration). Science as Verifiable Search is the same observation about scientific method: science is search through hypothesis space, and the bottleneck is the cost of testing. Cheap verification enables fast iteration.
At training time, verifiable rewards let you RL a model into producing better reasoning (DeepSeek-R1, GRPO). At inference time, verifiable rewards let you search over candidate solutions (MCTS, best-of-N). At the level of scientific discovery, verifiable predictions let you prune hypothesis space. Sutton’s “Reward is Enough” is the abstract version of this.
The practical question for LLM reasoning: can you write a verifier? If yes, search is worth trying. If not, best-of-N with an LLM judge is probably the ceiling.
Code
Open source: mcts-reasoning.
pip install -e ".[anthropic]"
export ANTHROPIC_API_KEY=your-key
python experiments/run_csp.py --hard --budget 1.00
The provider tracks token usage and enforces budget caps.
Limitations
The problems are hand-crafted. A generator calibrated by Pass@1 rate would be more convincing. Ten problems shows the pattern but isn’t enough for statistical significance.
I tested one model. A weaker model might show MCTS advantage on simpler problems. A stronger one would need harder problems.
The MCTS exploration context shows which solutions scored well and poorly, but not which specific constraints were violated. Adding evaluator feedback to the exploration prompt is an obvious improvement I haven’t tried yet.
Summary
Three conditions for MCTS to help LLM reasoning:
- A deterministic verifier. Not a learned reward model, not an LLM judge.
- Partial credit from the verifier. A gradient, not just pass/fail.
- A problem hard enough that blind sampling can’t reliably solve it.
When all three hold, MCTS outperforms best-of-N, self-consistency, and single-pass. When any one fails, it doesn’t.
Discussion