Standard RAG retrieves few-shot examples by embedding similarity, which doesn’t learn from outcomes. A trace that looks similar but leads the LLM astray gets retrieved just as readily as one that consistently helps. Closing that loop sounds clean.
Here’s the setup. Store every reasoning trace the LLM produces. When a new problem arrives, retrieve the top-k most similar traces, feed them as few-shot examples, observe the solution, score it. Now do something standard RAG doesn’t: assign each stored trace a learned value V(T) representing its utility as a few-shot example, and weight retrieval by a mix of similarity, learned value, and a UCB exploration bonus.
The learning rule is TD(0):
V(T) <- V(T) + alpha * (r + gamma * V(T_new) - V(T))
T is a retrieved trace, r is the reward (did the solution score well?), T_new is the trace just produced. The bootstrap term gamma * V(T_new) is supposed to be the magic. If T_new later turns out to be a useful example in its own right, V(T_new) rises, and on subsequent updates that rise flows back to T. Credit propagates through chains of retrieval influence (A retrieved to help produce B, B retrieved to help produce C, so A gets some credit for C) without explicit graph traversal. It’s textbook TD, applied to a graph where the nodes are stored traces and the edges are retrieval provenance.
I built this, evaluated on GSM8K with Haiku, and spent a few days tuning it.
Here’s the problem. Compare the full method against a trivial baseline: set V(T) = whether trace T solved its own problem correctly, weight retrieval by that, never update it. No TD. No influence graph. No bootstrapping. Just “prefer exemplars whose own answers were right.” That baseline matches the full method.
The improvement over similarity-only retrieval isn’t coming from learning. It’s coming from not retrieving exemplars with wrong answers. On GSM8K, correctness is binary and “good trace = good exemplar.” A lookup table of correctness captures everything the TD machinery laboriously rediscovers.
The lesson isn’t that TD on an influence graph is wrong in principle. It’s that for the mechanism to matter, good-trace and good-exemplar have to diverge. You need tasks where a trace can be correct but unhelpful as a demonstration, or incorrect but pedagogically useful (common failure modes, near-miss reasoning). GSM8K is not that. It’s a task where the reward signal is exactly what you want to retrieve on, and any quality tracker converges to the same answer.
If I come back to this, I’ll try code generation, where a correct solution can be a misleading template for a subtly different problem, or theorem proving, where the useful examples are often the failed attempts.
For now: run the trivial baseline first.
Discussion