Nvidia Cosmos 3

6/1/2026 at 2:05:11 PM

SOTA open source model for image and vid generation. Beats all others but is too big to run on most people’s computers at 64b params.

Still impressive nonetheless given its artificially generated training sets.

Beats nano banana 1 but not yet competitive with 2 or seedance2, grok imagine,etc.

by aabdi

6/1/2026 at 6:05:35 PM

It's sadly ironic I no longer even bother clicking on HN posts that are obvious product announcements from large corporations and instead just go to the replies. Corporate product announcements somehow fail to even clearly communicate the basic facts you did in your first nine words.

One nuance that's missing from your summary is it's a world model specifically targeted to be useful for training robotic and autonomous vehicle AIs. So not really intended to be a direct competitor to Nano Banana or Seedance. While it can do straight image and video gen, its special sauce is providing more physics data and harnesses for AI training scenarios.

by mrandish

6/1/2026 at 2:41:59 PM

Great summary. I find image and video generation models are a more understandable reality check for how close local models are to frontier models.

by xnx

6/1/2026 at 2:51:29 PM

> Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.

Looking forward to trying this out on my $10000+ workstation grade GPU that I need an equally expensive set up to run.

by darth_avocado

6/1/2026 at 4:39:18 PM

I have the GPU but no robot. What’s the minimum viable robot needed to play with this?

by Gracana

6/1/2026 at 8:12:06 PM

Not at all an expert but I believe it's possible to get started experimenting with just a simulated robot in the simulated world model. While the full workflow is to generate training data to drive a real robot in the real world, without closing the loop, you're just lacking the ground truth data to quantify the divergence between simulation and reality.

There are all kinds of hobbyist robotic armatures at various price points but my understanding from a friend in this space is that the precision, durability and repeatability for serious applications starts at around $30,000 to $50,000. He mentioned the Franka Research 3 (FR3) as one example (https://franka.de/), perhaps driven by something like a Jetson AGX Thor ($5,000 and up).

As always, there are many less expensive and DIY-ish recipes to get started on smaller budgets. My friend's suggestion was more the baseline experimental lab system for a big company wanting get started with something that could, in theory, scale to light industrial internal deployment.

by mrandish

6/1/2026 at 3:52:18 PM

Good news, Nvidia will happily sell you one of their new RTX Spark laptops to run this.

by thewebguyd

6/1/2026 at 3:59:18 PM

  This release unifies those capabilities with a Mixture-of-Transformers (MoT) architecture built around two towers. 
  Reasoner tower: A vision-language model (VLM) ... This serves as the ‘brain’ that reasons about the world before any generation happens.
  Generator tower: Generates future observations and action sequences. This tower uses a diffusion-based process to generate physics-aware video and action outputs that are conditioned on the reasoner tower’s understanding.

This sort of approach (and others i've seen like it) always appeal to my inner engineer, trying to optimize and balance tradeoffs between model architectures and combine two things to yield the best of both worlds

But based on my understanding of the Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html), this is precisely the wrong approach in the long term. I'm linking the actual text of the bitter lesson because I think it's misunderstood (or I just don't agree with how i've seen it used in discourse). Specifically:

  The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

This architecture feels specifically like "trying to build knowlege into the agent that will help in the short term" but will plateau long term. That's not to say that there won't be some interesting learnings or things built on top of it, but I doubt that there's a lot of juice to squeeze with this kind of approach IMO.

by mangoman

6/1/2026 at 4:16:59 PM

This feels like the opposite to me? The MoT architecture looks like the ideal that the Bitter Lesson alludes to - just take all of your data in all of your formats (audio, image, text, action, video) and dump it all into a single shared latent space. Then let the model sort things out, with just enough structure to handle the different requirements/output formats needed (e.g. autoregressive stuff for sequence modeling/prediction, diffusion stuff for generation).

by 3PS

6/1/2026 at 5:06:48 PM

This is mostly a decompression, it’s fairly standard nowadays. The point is to get the data from the internal compressed version into the human usable version.

We can technically reason at pixel or char level encodings but it’s going to be much more expensive generally. Think of the overall technique as a way to get computer go faster.

You see it with Qwen talker, most multimodal projectors, etc

by aabdi

6/1/2026 at 4:54:25 PM

Except this model has a broader domain than text-LLM models. More than the old omni models too since it takes video input. The architecture is exotic but I don't see tuning here that is more extreme than open models released every day.

by samuelknight

6/1/2026 at 3:49:51 PM

The warehouse safety video example is really funny, because the people don't react at all.

by BugsJustFindMe

6/1/2026 at 4:23:41 PM

The car video is silly as well, the crossing van clearly runs a red light. The big shadow of the light pole in the intersection also makes no sense...

by sqeak

6/1/2026 at 6:55:50 PM

I feel like the car usecase demonstrates that these models are not really useful for the cutting edge: They produce exactly the kind of in-domain data that already exists in droves. What is needed, and what tesla collects, are the edge cases!

(Now for a startup with zero data, this is of course still useful)

by ThouYS

6/1/2026 at 4:52:56 PM

Cars run red lights in real life. Driving defensively requires anticipating it. Anyone expecting them not to is more likely to get in a crash.

The rest I can't speak to.

by timschmidt

6/1/2026 at 9:21:02 PM

The two-tower Mixture-of-Transformers design (autoregressive reasoner feeding a diffusion generator) is an interesting architectural bet.

by ramaseshanms

6/1/2026 at 2:20:16 PM

I'm struggling to understand what this does.

> Generates future observations and action sequences.

Is that just a complicated way of saying video gen?

by causal

6/1/2026 at 8:12:03 PM

No, the "action" part is the distinction. Their world model is conditioned on robot actions for example, which gives you two things the video gen alone can't: predict the future frames that follow a given action (change the action, get a different future from the same starting frame), and run it in reverse to infer the actions behind observed frames or output the actions needed to hit a goal (the output is motor commands abd not video frames).

by protortyp

6/1/2026 at 2:25:21 PM

As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video

by swiftcoder

6/1/2026 at 6:47:41 PM

If I were to hallucinate what it is and why it's worded that way: AI robot space is in need of a hyper-realistic game engine with better physics than Unity/Unreal style non-deformable rigid body mechanics, that's also way faster than 1x completely unlike engineering FEM sims, and this cater to that need

by numpad0

6/1/2026 at 2:36:34 PM

Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.

by derac

6/1/2026 at 3:22:44 PM

That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?

by causal

6/1/2026 at 3:44:17 PM

It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.

by heliosAtwork

6/1/2026 at 3:05:40 PM

You can fine-tune it so, given an image and a task description, it generates a corresponding set of actions.

by ainch

6/1/2026 at 3:51:42 PM

Most of the examples they've chosen seem.. not good? What an odd mix of bad game engine and AI slop. I can't imagine that this stuff makes good training data for real-world applications.

by sosodev

6/1/2026 at 8:54:06 PM

These demos honestly look pretty good to me. But it is objectively true that this and similar technologies are used at huge scale by every leading autonomous vehicle manufacturer, so we can inductively reason that it _is_ good enough for that use-case. I don't work on Cosmos, but I am currently working on a superficially similar non-open technology at Nvidia used by many of these leaders which, in my opinion, produces similar quality. Some of the open research for it is here:

https://github.com/nv-tlabs/3dgrut/

https://github.com/NVIDIA/harmonizer

https://github.com/NVIDIA/instant-nurec

https://github.com/nvidia/ncore

Nvidia also is integrating Gsplat into at least what I work on and contributing upstream.

https://github.com/nerfstudio-project/gsplat

by Conscat

6/1/2026 at 5:44:35 PM

It is funny that after all their tech advancements, the site is struggling under heavy load.

by cesarvarela

6/1/2026 at 4:03:04 PM

[flagged]

by overfits-ai

6/1/2026 at 1:48:36 PM

[flagged]

by kushagra1211