alt.hn

5/22/2025 at 10:10:09 AM

Strengths and limitations of diffusion language models

https://www.seangoedecke.com/limitations-of-text-diffusion-models/

by rbanffy

5/22/2025 at 3:46:38 PM

I'm curious, in image generation, flow matching is said to be better than diffusion, then why do these language models still start from diffusion, instead of jumping to flow matching directly?

by billconan

5/22/2025 at 6:14:02 PM

This is just a guess but I think it’s due to diffusion training being more popular so we’ve figured more of the kinks with those models. Flow matching models might follow after you figure out some of their hyperparameters.

by gessha

5/22/2025 at 6:17:00 PM

Great overview. I wonder if we'll start to see more text diffusion models from other players, or maybe even a mixture of diffusion and transformer models alternating roles behind a single UI, depending on the context and request.

by accrual

5/22/2025 at 7:24:13 PM

The diffusion models are (or can be) transformer models! They're just not autoregressive.

by shrubhub

5/22/2025 at 2:00:38 PM

That's a nice explanation. I wonder whether autoregressive and diffusion language models could be combined such that the model only denoises the (most recent) end of a sequence of text, like a paragraph, while the rest is unchangeable and allows for key-value caching.

by cubefox

5/22/2025 at 6:20:04 PM

Hi, I wrote the post. Thank you!

That’s how it does work, but unfortunately denoising the last paragraph requires computing attention scores for every token in that paragraph, which requires checking those tokens against every token in the sequence. So it’s still much less cacheable than the equivalent autoregressive model.

by gfysfm