6/12/2026 at 8:41:58 AM
Don't work at a lab but I think they might be warping the probability distribution in the decoding step, at least to generate RL examples for training and maybe in production too.There aren't other comments discussing this possibility at the moment, but you don't have to take the token predicted as most likely (greedy decoding). Most decoding strategies do something else which is where settings like temperature come in. So if you want the model to "think harder" you can track whether the current tokens are thinking or answer - in OpenAI's system that's called a channel - and then if you're in a thinking block you might get a model output whose top three predictions are:
60% <|channel=answer|>
10% Wait,
5% . The
[...]
Greedy decoding would stop thinking at this point and start answering, but you want the model to keep thinking so you skip that token and select the next most likely which is "Wait, ". The reasoning levels can map to the probability of skipping the channel change tokens.
by mike_hearn
6/12/2026 at 11:45:35 AM
I also thought of this as a general idea: intelligence at the sampling level. Broadly you have different tiers of intelligence1. take highest probability
2. based on some light weight code that tracks some state - like number of tokens or some sampling distribution
3. higher level is using a smaller llm to decide which token to sample (just a thought)
by simianwords