2/23/2026 at 9:30:25 PM
I don't think this counts as distillation. Distillation is when you use a teacher model to train a student model, but crucially, you have access to the entire probability distribution of the generated tokens, not just to the tokens themselves. That probability distribution increases tremendously the strength of the signal, so the training converges much faster. Claude does not provide these probabilities. So, Claude was used for synthetic training data generation, but not really for distillation.by credit_guy
2/23/2026 at 11:19:05 PM
Sampling repeatedly gives them an estimate of the probability distribution in any case though.by hooloovoo_zoo
2/23/2026 at 11:24:30 PM
That would be an interesting paper actually; what is the optimal sampling technique given you only have access to the token outputs. Surely someone has already done it.by hooloovoo_zoo