How Monte Carlo Tree Search turns single-pass reasoning into a branching exploration of ideas.
An Explorable Explanation
Watch an LLM Think
How Monte Carlo Tree Search turns single-pass reasoning into a branching exploration of ideas
One Shot, One Chance
Here is a classic logic puzzle:
A says: "B is a knave."
B says: "We are the same type."
Is A a knight or a knave?
(Knights always tell the truth. Knaves always lie.)
Let's ask an LLM to solve it:
The LLM assumed A is a knave and never tested the alternative. It reached a confident answer without checking all the constraints.
What if the LLM could explore both assumptions simultaneously, and backtrack from the one that leads to contradiction?
The Tree Appears
Instead of committing to a single reasoning path, MCTS explores multiple paths simultaneously. It builds a tree where each branch represents a different line of reasoning.
But how does it decide which branch to explore next? It uses a formula called UCB1:
The first term (value) measures how promising a path has been so far. The second term (exploration) favors paths that haven't been tried much.
The constant c controls the balance. Try adjusting it:
Branching Thoughts
Once UCB1 selects a node, the LLM generates a new reasoning step from that point. The tree grows.
Each branch represents a different assumption the LLM is testing. This is structured exploration, not random guessing.
Click any node in the tree to see its full reasoning.
Following the Thread
From the expanded node, the LLM keeps reasoning until it reaches an answer or hits a maximum depth. This is called a rollout.
Watch the nodes appear one by one as the LLM follows its reasoning to a conclusion.
How is this different from regular MCTS?
In classical MCTS, rollout nodes are simulated but discarded. Only the evaluation score is kept. Here, every reasoning step is preserved in the tree.
This means we keep a complete record of every reasoning trace the LLM explored.
What's a Good Answer?
The rollout reaches an answer, but how do we score it? The evaluator checks whether the reasoning is logically consistent with the puzzle's constraints.
A correct derivation with no contradictions scores 1.0. A derivation that hits a contradiction scores 0.0. Weak reasoning that reaches the right answer by luck scores somewhere in between.
Scores Flow Upward
When a reasoning path reaches a conclusion and gets scored, the result propagates back up the tree. Each ancestor node updates its average value and visit count.
Good answers make nearby paths more attractive for future exploration. The tree learns where to search next.
Many Paths, One Answer
Which paths should we look at?
After 20 simulations, the tree has explored many reasoning paths. But which ones matter? Different sampling strategies highlight different aspects:
Showing paths with the highest average values.
How do we pick the final answer?
With multiple paths reaching conclusions, we can use voting to decide:
When multiple independent reasoning paths agree, we can be more confident in the answer.
Try It Yourself
This section requires a local Ollama instance to run live MCTS searches.
Connect Ollama at localhost:11434 to try this.
Discussion