3/21/2026 at 6:22:28 AM
I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.by nl
3/21/2026 at 9:15:32 AM
You mean Mercury 2, by Inception: https://openrouter.ai/inception/mercury-2by PhilippGille
3/21/2026 at 8:34:53 AM
That's completely different. That's like saying you want to compare the Nvidia 5090 GPU to the latest Call of Duty.by jychang
3/21/2026 at 1:40:00 PM
You are right, people who downvoted you are just ignorant.by cubefox
3/21/2026 at 7:38:27 AM
Mamba-3 is an architecture while diffusion is, I believe, a type of objective. So these are not mutually exclusive and therefore not comparable.by cubefox
3/21/2026 at 9:50:15 AM
Not wrong, but I think it's more accurate to say:Mamba is an architecture for the middle layers of the network (the trunk) which assumes decoding takes place through an autoregressive sequence (popping out tokens in order). This is the SSM they talk about.
Diffusion is an alternative to the autoregressive approach where decoding takes place through iterative refinement on a batch of tokens (instead of one at a time processing and locking each one in only looking forward). This can require different architectures for the trunk, the output heads, and modifications to the objective to make the whole thing trainable. Could mamba like ideas be useful in diffusion networks...maybe but it's a different problem setup.
by gyrovagueGeist
3/21/2026 at 2:26:13 PM
Mamba doesn't assume auto-regressive decoding, and you can use absolutely use it for diffusion, or pretty much any other common objective. Same with a conventional transformer. For a discrete diffusion language model, the output head is essentially the same as an autoregressive one. But yes, the training/objective/inference setup is different.by joefourier
3/21/2026 at 1:41:48 PM
Linear architectures are at least heavily used in image diffusion models. More so in fact than in language models.by cubefox
3/21/2026 at 1:30:14 PM
I mean I guess but the diffusion objective and the ability to do simultaneous decode both dictate pretty different architectures in practice.by nl
3/21/2026 at 1:39:04 PM
Apparently not. See https://arxiv.org/abs/2511.15927v3by cubefox