5/22/2025 at 2:00:38 PM
That's a nice explanation. I wonder whether autoregressive and diffusion language models could be combined such that the model only denoises the (most recent) end of a sequence of text, like a paragraph, while the rest is unchangeable and allows for key-value caching.by cubefox
5/22/2025 at 6:20:04 PM
Hi, I wrote the post. Thank you!That’s how it does work, but unfortunately denoising the last paragraph requires computing attention scores for every token in that paragraph, which requires checking those tokens against every token in the sequence. So it’s still much less cacheable than the equivalent autoregressive model.
by gfysfm