3/15/2026 at 4:20:45 AM
> One might note that MCTS uses more inference compute on a per-sample basis than GRPO: of course it performs betterThis part confused me, it sounded like they were only doing the MCTS at train time, and then using GRPO to distill the MCTS policy into the model weights. So wouldn’t the model still have the same inference cost?
by supermdguy
3/15/2026 at 6:50:01 AM
Ah, I meant that MCTS uses more inference-time compute (over GRPO) to produce a training sampleby at2005