2/20/2026 at 3:49:40 PM
Diffusion model papers are always interesting to read but I always feel like they need some mechanism to insert or delete tokens. In the example in the figure in this post, once it has fixed "British munchkin cats _ _ and ..." you _can't_ get to "British munchkin cats are a new and controversial breed." because there's not the right number of tokens between "cats" and "and". In a coding context, if your model samples a paren or a comma or something which is entirely plausible at that position, it can still close off an expansion which would be syntactically correct.by abeppu
2/20/2026 at 8:15:14 PM
OK, but then, in this regard, left to right generation is hardy better:Once you get to "British cats <next-token-here>" you can't get to "British munchkin cats <next-token-here>"; the tokens to the left are done and dusted.
It's kind of a feature. Diffusion is used for images, right? It's like saying, once the image of a door has started to form right next to a kitchen counter, it cannot insert a refrigerator there any more. Well, maybe it doesn't "want to" because that layout is already settled by that time.
by kazinator
2/20/2026 at 5:59:21 PM
This blogpost references block diffusion which fixes this issue that you are describing.by LarsDu88
2/20/2026 at 10:08:16 PM
The cat example is from the section on their block-causal attention mask. I really don't think this fixes the issue. So far as I can see, the block schedule dictates when they sample at each position. It does _not_ change that they basically have an array-of-token-vars representation, and once `t_i` is sampled, nothing can "move" that value left or right.by abeppu
2/20/2026 at 5:57:39 PM
But the "infilling" problem isn't exactly solved for AR LLMs, so it's a strange critique.Further more, you're applying the logic of AR LLMs to diffusion models. AR LLMs are only seeking the probability of the next token (a chain of conditional probability), but diffusion LLMs are modeling the probability of the entire output at once. Because of this token structures that leads to invalid outputs should be extremely low probability if properly trained.
by crystal_revenge
2/20/2026 at 4:58:58 PM
I think that having an early draft of the output is part of the appeal of this type of models.by moralestapia
2/20/2026 at 5:08:19 PM
Early draft yes. But when you write an early draft of prose or code, you leave yourself the ability to insert or remove material in a way that _changes the indexes of the tokens you already put in your draft_. If you write a letter, you may know that it ends with "Yours Truly, <your name>", but not know the absolute number of tokens the letter will use. In this framework, once you say that "Yours Truly, John Hancock" are tokens 501 to 506, infilling the preceding sentences requires that you exactly preserve the number of tokens before that point ... which to me seems silly. I'm sure it's computationally messy to be able to slide stuff around, but if it meaningfully changes the topology of the search process, it may be worth it.by abeppu
2/20/2026 at 4:50:15 PM
IIRC, some researchers are working on mixed AR+diffusion models for this sort of thing.by naasking
2/20/2026 at 5:26:33 PM
I think the gap is, if they're building hybrids with _forward_ AR and diffusion, they risk giving up the cool part of diffusion which is reasoning back. I may be imposing unreasonable human biases on to this, but I really think it would be interesting to have the model engage with the structure of the text, rather than just being either a sequence or an array of tokens. E.g. "I'm going to _ tomorrow." If the _ is not just a token but an expansion in context, which might be a noun phrase, a verb phrase etc, it could be filled in with "the mall", "practice guitar". In code "if (_1) { return _2; }", _1 could be an expression whose type is bool, and which makes sense as a check to confirm that some process is finished. I don't care specifically how many tokens either of those is, but I do care that it makes sense in context.by abeppu
2/20/2026 at 9:43:23 PM
I was thining of something like LLaDa that uses a Transformer to predict forward masked tokens:by naasking