π0.5: A VLA with open-world generalization

4/22/2025 at 9:32:57 PM

Most of it is open source. Their VLAs are based upon Gemma models + vision encoders, plus their own action experts. You can download and play around or fine tune their Pi0 VLAs from their servers directly (JAX format) or from Huggingface LeRobot safetensors port. They also have notebooks and code in their repo to get started with fine-tuning. Inference runs in a single 4090 RTX streamed over WiFi to the robot.

by bytesandbits

4/22/2025 at 10:10:03 PM

OpenAI is among their investors, which makes me wonder how long their work remains "open".

by amelius

4/22/2025 at 6:06:47 PM

This is amazing! As someone working with industrial robots, normally under strict environmental constraints and control, witnessing such real-world robotics progress truly excites me about the future!

By the way, they’ve open-sourced their π0 model (code and model weights). More information can be found here: https://github.com/Physical-Intelligence/openpi

by beklein

4/22/2025 at 6:17:34 PM

It seems robotics has advanced more in the last 3 years than the previous 20.

by UltraSane

4/23/2025 at 12:30:07 AM

the torrent of funding helps here

by htrp

4/23/2025 at 10:40:02 AM

The vision language action models and the two level slow planning and fast control LLMs seem to be a big breakthrough.

by UltraSane

4/23/2025 at 6:02:06 AM

ML helps here and the progress Nvidia made with their robotics platform.

by Tireings

4/23/2025 at 11:53:41 AM

But mostly OpenCV, in its excellent C++ and python variants. Not everything is modern ML heuristics, some classic AI is also needed still.

by rurban

4/22/2025 at 6:40:53 PM

I'm genuinely asking (not trying to be snarky)... Why are these robots so slow?

Is it a throughput constraint given too much data from the environment sensors?

Is it processing the data?

I'm curious about where the bottleneck is.

by djoldman

4/23/2025 at 1:18:58 AM

It is inference latency most of the time. These VLA models take in an image + state + text and spit out a set of joint angle deltas.

Depending on the model being used, we may get just one set of joint angle deltas or a series of them. In order to be able to complete a task, it will need to capture images from the cameras, current joint angles and send them to the model along with the task text to get the joint angle changes we will need to apply. Once the joint angles are updated, we will need to check if the task is complete (this can come from the model too). We run this loop till the task is complete.

Combine this with the motion planning that has to happen to make sure the joint angles we are getting do not result in colliding with the surroundings and are safe, results in overall slowness.

by ajhai

4/22/2025 at 7:33:52 PM

Not a PI employee, but diffusion policies are like diffusion models for image generation, they generate actions from noise in multiple steps. With current compute you can't run 100+Hz control loops with that kind of architecture.

Some combination of distillation, new architectures, faster compute, can eventually attack these problems. Historically as long as something in tech has been shown to be possible, speed has almost always been a non-issue in the years afterwards.

For now even getting a robot to understand what to do in the physical world is a major leap from before.

by dheera

4/23/2025 at 12:22:49 AM

That's not the reason, Pi0 was basically predicting at 10hz and predicting a temporal chunk up to 50 points so it could go up to 500Hz.

It's slow because the original telop is slow, and the learned controllers through imitation learning is always a bit slower.

Source : i work on this (not at PI)

by davidguetta

4/23/2025 at 5:07:06 AM

Another practical reason is that it's dangerous.

Pi0 uses ARX robot arm which weights 3-4kg per arm. It can easily break things or harm people if you allow it to move fast.

by cloudbonsai

4/23/2025 at 10:37:55 AM

Not really if you clamp the torque aggressively.

But yeah in general the physical world is more dangerous than we tend to think

by davidguetta

4/22/2025 at 10:05:09 PM

You're probably right, but for some tasks I suppose you need processing speed, for example bipedal walking.

by amelius

4/22/2025 at 8:10:50 PM

When you're operating your robot around humans, you want to be very confident it won't injure anyone. It'd be pretty bad if a bug in your code meant instead of putting the cast iron frying pan in the dishwasher, it sent it flying across the room.

One way of doing that is to write code with no bugs or unpredictable behaviour, a nigh-impossible feat - especially once you've got ML models in the mix.

Another option is to put a guard cage around your robot so nobody can enter pan-throwing distance without deactivating the robot first. But obviously that's not practical in a home environment.

Another option is just to go slowly all the time. The pan won't fly very far if the robot only moves 6 inches per second.

by michaelt

4/22/2025 at 11:12:47 PM

Putting the cast iron frying pan in the dishwasher would also be pretty bad.

by reverius42

4/23/2025 at 12:11:55 AM

Maybe this robot is satisfying a rust production utility function. Don't be so bioist. All utility functions are beautiful.

by idiotsecant

4/23/2025 at 2:18:14 AM

Annoying perhaps. But not bad.

by jagged-chisel

4/22/2025 at 6:49:15 PM

Part of it is that training of these VLAs currently happens on human teleop data which limits speed (both for safety reasons and because of actual physical speed constraints in the teleoperation pipeline).

Let’s see how it changes once these pipelines follow the LLM recipes to use more than just human data…

by robopolicy

4/22/2025 at 8:44:48 PM

The primary bottleneck is typically the motion planning system that must continuously solve complex optimization problems to ensure safe trajectories while avoiding collisions in dynamic environments.

by ethan_smith

4/22/2025 at 9:12:34 PM

These models typically predict actions directly, there is no motion planning going on here.

by vhartman

4/22/2025 at 8:01:30 PM

Amazing! On a fun note, I believe if a human kid were cleaning up the spill and threw the sponge into the sink like that, the kid would be in trouble. XD

by huydotnet

4/24/2025 at 9:12:49 AM

cleaning a spill consists mostly of spreading it over the whole counter

by scotty79

4/22/2025 at 6:32:19 PM

These variable-length arrays are getting quite advanced

by meisel

4/22/2025 at 8:56:51 PM

Precisely my thoughts.

by layer8

4/22/2025 at 6:46:59 PM

Ignore the haters. This is hilarious

by matthewfcarlson

4/22/2025 at 6:08:32 PM

Is the robot platform they're using something they've developed themselves? The paper doesn't seem to mention any details outside of sensors and actuators.

by gs17

4/22/2025 at 6:10:20 PM

Off the shelf robots -- we've got our models running on dozen+ different robot types (and have this specific generalization demo working on multiple platforms too.)

by lachyg

4/22/2025 at 6:16:46 PM

Great, would you happen to know what's used in this video?

by gs17

4/22/2025 at 6:20:41 PM

Here are some of the suppliers for things seen in the videos:

https://arx-x.com/

https://x.com/GalaxeaDynamics

https://www.youtube.com/@HEXMOVEHexmove_Robotic

https://www.trossenrobotics.com/

by modeless

4/23/2025 at 2:52:14 PM

So you are saying I can buy some robot, GPU and have this robot fold my laundry? How much? :D

by npodbielski

4/22/2025 at 8:05:23 PM

Does the general laws of demos apply here? Than any automation shown is the extent of capabilities not the start?

by th0ma5

4/22/2025 at 8:45:18 PM

One thing I notice is that they specify that the robot has never seen the homes before, but certain objects, like the laundry baskets, are identical.

Doing your demo is significantly easier if you've already programmed/trained the robot to recognize the specific objects it has to interact with, even if those items are in different locations.

by fwip

4/22/2025 at 9:14:54 PM

They also got these things working corners of a location instead of stacking tasks on different areas of the same location. And even on these "one-area" task groups it can fail a good amount. Kudos to them for showing the failures though

by horhay

4/24/2025 at 10:46:36 AM

isn't object recognition essentially solved? AI models were beating humans at image classification (in terms of error rate) back in 2016. even if this particular model isn't the best at it, they can always call out to an API or have a secondary on-device VLM that has stronger object recognition capabilities

by dissahc

4/22/2025 at 10:50:47 PM

Thank you all I guess the answer is yes.

by th0ma5

4/22/2025 at 11:23:23 PM

VLA = vision-language-action, a kind of a machine learning model

by yencabulator

4/22/2025 at 7:09:19 PM

I'm just a layman, but I can't see this design scaling. It's way too slow and "hard" for fine motor tasks like cleaning up a kitchen or being anywhere around humans, really.

I think the future is in "softer" type of robots that can sense whether their robot fingers are pushing a cabinet door (or if it's facing resistance) and adjust accordingly. A quick google search shows this example (animated render) which is closer to what I imagine the ultimate solution will be: https://compliance-robotics.com/compliance-industry/

Human flesh is way too squishy for us to allow hard tools to interface with it, unless the human is in control. The difference between a blunt weapon and the robot from TFA is that the latter is very slow and on wheels.

by airstrike

4/22/2025 at 8:35:23 PM

The development here is primarily in the model. If someone invents the 'brains' a robot needs to do useful domestic tasks then there will suddenly be a lot of incentive to build the right body for it.

by nullc

4/23/2025 at 12:22:27 AM

Right, but ISTM that building the right body is a much harder problem than people are willing to admit.

Software isn't constrained by the harsh truths of physical reality.

by airstrike

4/22/2025 at 8:48:37 PM

Finally, machines doing the work we dont want to do

by desertmonad

4/23/2025 at 1:20:52 AM

https://x.com/ajhai/status/1899528923303809217 something I have been working on for a few months now.

by ajhai

4/23/2025 at 3:49:52 AM

> Investors > We are grateful for the support of Bond, Jeff Bezos, Khosla Ventures, Lux Capital, OpenAI, Redpoint Ventures, Sequoia Capital, and Thrive Capital.

by zx8080