alt.hn

3/15/2026 at 12:51:55 AM

Tree Search Distillation for Language Models Using PPO

https://ayushtambde.com/blog/tree-search-distillation-for-language-models-using-ppo/

by at2005

3/15/2026 at 4:20:45 AM

> One might note that MCTS uses more inference compute on a per-sample basis than GRPO: of course it performs better

This part confused me, it sounded like they were only doing the MCTS at train time, and then using GRPO to distill the MCTS policy into the model weights. So wouldn’t the model still have the same inference cost?

by supermdguy

3/15/2026 at 6:50:01 AM

Ah, I meant that MCTS uses more inference-time compute (over GRPO) to produce a training sample

by at2005

3/15/2026 at 2:03:35 PM

I may never understand what harness means - it's used in so many contexts

by qumpis

3/15/2026 at 5:28:58 PM

Its a thing that isn't part of the "subject", used with the subject, to manipulate the state of the "the subject" to be closer to what we want.

by blamestross

3/15/2026 at 6:20:11 AM

Great post! I wonder why MCTS is not more popular as a test time compute harness. Did you compare performance of MCTS (without distillation) against other methods (eg best of N) with the same compute budget?

by natufunu

3/15/2026 at 11:57:26 AM

Why is almost every RL paper done on Qwen-2.5 ? That decreases its credibility.

by richardvsu

3/15/2026 at 2:12:35 PM

It makes it easier to compare with other papers. If two different papers apply different methods to different models and get different results, how do you know which method is better?

Once you have identified the best method and want to productize it, it would of course make sense to apply it on top of the best model, but if you're just doing research, you can skip that expensive last step.

by yorwba

3/15/2026 at 3:18:18 PM

> Why is almost every RL paper done on Qwen-2.5 ?

In what way does using this model reduce the authors credibility?

by mapontosevenths

3/15/2026 at 9:39:26 AM

great write up (and effort!! ;))

what are your thoughts on MCTS for coding?

this can/must be paired with a smart execution harness to optimise roll out and roll back of execution paths and system state.

does this change the calculus for optimal post-training ?

by algo_trader

3/15/2026 at 3:27:04 AM

[flagged]

by biang15343100

3/15/2026 at 8:30:22 AM

[dead]

by devcraft_ai

3/15/2026 at 2:21:14 AM

[dead]

by puildupO