alt.hn

3/17/2026 at 10:45:12 PM

Mamba-3

https://www.together.ai/blog/mamba-3

by matt_d

3/21/2026 at 6:22:28 AM

I'm looking forward to comparing this to Inception 2 (the text diffusion model) which in my experience is very fast and reasonably high quality.

by nl

3/21/2026 at 8:34:53 AM

That's completely different. That's like saying you want to compare the Nvidia 5090 GPU to the latest Call of Duty.

by jychang

3/21/2026 at 1:40:00 PM

You are right, people who downvoted you are just ignorant.

by cubefox

3/21/2026 at 7:38:27 AM

Mamba-3 is an architecture while diffusion is, I believe, a type of objective. So these are not mutually exclusive and therefore not comparable.

by cubefox

3/21/2026 at 9:50:15 AM

Not wrong, but I think it's more accurate to say:

Mamba is an architecture for the middle layers of the network (the trunk) which assumes decoding takes place through an autoregressive sequence (popping out tokens in order). This is the SSM they talk about.

Diffusion is an alternative to the autoregressive approach where decoding takes place through iterative refinement on a batch of tokens (instead of one at a time processing and locking each one in only looking forward). This can require different architectures for the trunk, the output heads, and modifications to the objective to make the whole thing trainable. Could mamba like ideas be useful in diffusion networks...maybe but it's a different problem setup.

by gyrovagueGeist

3/21/2026 at 2:26:13 PM

Mamba doesn't assume auto-regressive decoding, and you can use absolutely use it for diffusion, or pretty much any other common objective. Same with a conventional transformer. For a discrete diffusion language model, the output head is essentially the same as an autoregressive one. But yes, the training/objective/inference setup is different.

by joefourier

3/21/2026 at 1:41:48 PM

Linear architectures are at least heavily used in image diffusion models. More so in fact than in language models.

by cubefox

3/21/2026 at 1:30:14 PM

I mean I guess but the diffusion objective and the ability to do simultaneous decode both dictate pretty different architectures in practice.

by nl

3/21/2026 at 9:44:37 AM

I'm not sure that I buy their conclusion that more compute during inference is good.

Yes, batch=1 inference is mostly memory bandwidth bound, not GPU compute bound. But no provider does batch=1 inference. Everyone groups all the requests into a batch, and the GPU computes them together.

With a fused kernel, that means the GPU streams the tensors from VRAM, and does a bunch of compute on different conversations in the batch, at the same time.

If they increase the amount of compute required per token, that just reduces the maximum batch size a GPU can handle. In practice, yes this does mean each GPU can serve less users. Providers aren't leaving GPU cores idle normally during inference.

by jychang

3/21/2026 at 9:56:36 AM

> Everyone groups all the requests into a batch, and the GPU computes them together.

You're only saving on fetching read-only parameters, and not even on that if you're using MoE models where each inference in the batch might require a different expert (unless you rearrange batches so that sharing experts becomes more likely, but that's difficult since experts change per-token or even per-layer). Everything else - KV-cache, activations - gets multiplied by your batch size. You scale both compute and memory pressure by largely the same amount. Yes, GPUs are great at hiding memory fetch latency, but that applies also to n=1 inference.

by zozbot234

3/21/2026 at 12:13:29 PM

Well, the actual inference providers put each expert on its own single GPU. Deepseek explicitly does this.

Read-only parameters is also usually the majority of space. Deepseek is 700GB of params. Meanwhile kv cache is small (Deepseek is about 7GB at max context) and ssm/conv1d cache is even smaller- IIRC Qwen 3.5 is 146MB per token regardless of context size. Not sure about how Mamba-3 works, but I suspect read-only parameters are still a significant amount of memory bandwidth.

I guess the question isn't whether compute is 1:1 with memory, but rather if you run out of compute before you run out of vram adding more users.

by jychang

3/21/2026 at 2:30:51 PM

> Well, the actual inference providers put each expert on its own single GPU.

Experts are usually chosen on a per-layer basis, not just by token, so I'd think this requires having lots of GPU's to make it worthwhile. You could do it with a single physical GPU by switching expert-layer mixes in a round-robin fashion after the batch for any single expert-layer mix is completed (essentially a refined version of expert offloading). But still, not easy.

by zozbot234

3/22/2026 at 6:19:39 AM

Correct, but note that's exactly what inference providers do.

https://arxiv.org/pdf/2412.19437

> The minimum deployment unit of the decoding stage consists of 40 nodes with 320 GPUs. The attention part employs TP4 with SP, combined with DP80, while the MoE part uses EP320.

EP320 means expert parallelism, each on 320 GPUs.

by jychang

3/21/2026 at 12:51:07 PM

Throughput is indeed king for the standard-tier mindshare-capture play. But there are many who would pay multiple times the current cost for agentic systems for engineers and executives, if it meant a meaningful reduction in latency. The economics could work extremely well.

by btown

3/21/2026 at 1:22:02 PM

Why would execs need latency?

I can see it for engineering - coding with slow ai is painful

by Havoc

3/21/2026 at 1:37:55 PM

Impatient execs can also be painful. EDIT: Writing this while I am waiting for Codex to complete, so I may enjoy slow AI more than the usual developer ;-)

by metanonsense

3/21/2026 at 7:02:02 PM

The economic effect of latency is measured not by the incremental productivity itself, but by the combined economic downforce of thousands of resulting HN and Reddit comments :)

by btown

3/21/2026 at 2:26:40 PM

Focusing on needs of providers isn't a very good long term strategy if you believe compute will eventually move to self hosted and on premises solutions where large batch sizes aren't needed.

by notnullorvoid

3/22/2026 at 6:25:54 AM

That's a foolish take.

That's like gamers thinking most of Nvidia's revenue coming from gaming GPUs, so Nvidia should prioritize gamers.

Inference is ruled by inference providers, not local. Local inference is a rounding error, and will remain as such unless there is economic incentive otherwise.

by jychang

3/21/2026 at 10:06:20 AM

Their latency measurements comparing Mamba-2 and Mamba-3 are done with a batch size of 128. It doesn't seem like Mamba-2 was compute-bound even at that batch size.

by yorwba

3/21/2026 at 11:49:42 AM

Well, Deepseek batch sizes are something like 8192, so 128 isn't much.

https://arxiv.org/html/2412.19437v1 "the batch size per expert is relatively small (usually within 256 tokens)"

by jychang

3/21/2026 at 2:51:57 PM

Local has a batch size of 1. If you are already memory bound then you leave compute on the table. Why not use it?

Not sure they target local though…

by sroussey

3/21/2026 at 1:27:51 PM

Is there a reason we don’t switch halfway through? ie start with a classic LLM and switch to something linear like mamba as context grows

by Havoc

3/21/2026 at 2:01:31 PM

Because something linear like Mamba doesn't perform as well; so you'd have a performance cliff, where suddenly the model would get more dumb and forget a lot of what was going on.

Instead, you can get benefits from both by doing both in parallel. This can let you reduce the size of the O(n^2) attention mechanism, so while it's still quadratic, it reduces the constant quite a bit while still retaining a lot of performance, as the linear context mechanism can work for the tasks its well suited for while allowing attention to play to its strengths.

The recent Nemotron 3 Nano and Super models from NVIDIA are hybrid architectures this way, with most of their context layers as Mamba while retaining enough attention to continue to be competitive on the more complex tasks that require the quadratic attention.

See https://magazine.sebastianraschka.com/i/168650848/18-nemotro... for some discussion on this architecture

by lambda

3/22/2026 at 3:20:10 PM

I am curious of the tradeoff of hybrid approaches, it sounds too good to be true.

by 3abiton

3/21/2026 at 5:03:36 PM

They did do that, 2 years ago. The problems are that 1) mamba makes accuracy worse as context size grows, 2) Nvidia GPUs are designed for transformers, and 3) all the software out there is also designed for transformers. It's still useful in some applications but it doesn't beat regular transformers if you have the gear

by 0xbadcafebee

3/21/2026 at 1:47:19 PM

Probably best achieved by model routing, either an indirection behind the chat UI or an API user does it themselves by calling a different API for long context queries.

by energy123

3/21/2026 at 2:01:24 PM

We kinda do do this with hybrid mamba transformers

by mountainriver

3/21/2026 at 1:52:41 PM

Linear time complexity models are bad at in-context retrieval, which limits their performance on various tasks, so a pure linear model isn't currently feasible anyway, at least for language models. Instead they recommend mixing linear and attention layers. Presumably this mostly solves the performance problem (at least n benchmarks), but it also means the mixed architecture is no longer linear. It will still be faster and less RAM hungry in long context than a pure transformer though.

by cubefox

3/21/2026 at 5:58:42 PM

Can anyone explain why Mamba models start with a continuous time SSM (and discretize) vs discrete time?

I know the step isn’t fixed, also not sure why that’s important. Is that the only reason? There also seems to be a parameterization advantage too with the continuous formulation.

by roger_

3/21/2026 at 3:56:02 PM

I'm glad I clicked through bc I thought the article was about Mamba, the package manager I associate with Python (similar to conda).

https://github.com/mamba-org/mamba

by jeffhwang

3/21/2026 at 6:02:44 PM

This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?

by fudged71

3/21/2026 at 7:32:22 PM

Probably constrained by training resources. It's much easier to experiment with a smaller architecture. You may need many training runs to figure out hyperparameters for example. If each run needs multiple GPUs for a week the cost adds up quickly. I think it makes a lot of sense to start small.

by snek_case

3/21/2026 at 5:56:40 PM

I'm looking forward to the fifth iteration of this model.

by manlymuppet

3/22/2026 at 4:45:34 PM

Mamboooo no. #5

by breadsniffer

3/21/2026 at 2:21:30 PM

[dead]

by diablevv

3/21/2026 at 1:02:30 PM

[dead]

by daliliu

3/21/2026 at 6:09:24 AM

> Mamba-3 is a new state space model (SSM) designed with inference efficiency as the primary goal — a departure from Mamba-2, which optimized for training speed. The key upgrades are a more expressive recurrence formula, complex-valued state tracking, and a MIMO (multi-input, multi-output) variant that boosts accuracy without slowing down decoding.

Why can’t they simply say -

Mamba-3 focuses on being faster and more efficient when making predictions, rather than just being fast to train like Mamba-2.

by robofanatic

3/21/2026 at 6:24:08 AM

This is sort of what their first sentence states? Except your line implies that they are fast in training and inference, they imply they are focusing on inference and are dropping training speed for it.

It's a nice opening as it is imo

by esquire_900

3/21/2026 at 8:38:15 AM

They don't say anything about dropping training speed.

by cubefox

3/21/2026 at 11:12:23 AM

> a departure from Mamba-2, which optimized for training speed.

?

by estearum

3/21/2026 at 1:43:09 PM

Yes? Mamba-2 optimized for training speed compared to Mamba-1. Mamba-3 adds optimization for inference. These are pretty much version numbers.

by cubefox

3/21/2026 at 5:46:24 PM

Agreed. What you wrote was probably the input, what we see is the LLM output with the directive to "make us sound smart, put gratuitous em-dash"

by i000

3/21/2026 at 6:20:40 AM

The first sentence basically does though, no?

by E-Reverance

3/21/2026 at 7:01:10 AM

Of course my only objection was the language. LLMs are now old enough to leave the jargon behind and talk in simple easy to understand terms.

by robofanatic

3/21/2026 at 8:20:08 AM

I’d argue the opposite, the terminology is fairly mainstream by now and “inference” has a much more specific sense than “making predictions”.

by oersted

3/21/2026 at 7:55:12 AM

The blog is technical, technical terms in the TL;DR seems relevant to me.

by mufasachan

3/21/2026 at 7:52:14 AM

I don't get the downvotes, as I had trouble understanding the intro as well. It seems it was written for a very specific audience.

by arendtio

3/21/2026 at 8:19:38 AM

Yes, it is written for a specific audience.

That is not a reason for snark.

As other commenters have noted, it’s well written.

by qeternity

3/21/2026 at 9:05:50 AM

> I don't get the downvotes

Because the blog post is a technical one and the intro contains very common jargon, and the proposed alternative was wrong.

by magicalhippo

3/21/2026 at 12:18:09 PM

Found the guy who made the Windows error messages say “Your computer did an oopsie :(“ instead of including any useful information.

by renewiltord

3/21/2026 at 8:14:56 AM

I don’t know why you’re being downvoted. As a longtime editor your version is immensely better. Looks like the original was probably not human-written.

by camillomiller

3/21/2026 at 1:51:18 PM

Why would the simpler version be better for a technical audience?

by stavros