Yesterday, I was watching some of the youtube videos on the website of a robotics company https://www.figure.ai that challenges some of the points in this article a bit.They have a nice robot prototype that (assuming these demos aren't faked) does fairly complicated things. And one of the key features they show case is using OpenAI's AI for the human computer interaction and reasoning.
While these things seem a bit slow, they do get things done. They have a cool demo of the a human interacting with one of the prototypes to ask it what it thinks needs to be done and then asking it do these things. That show cases reasoning, planning, and machine vision. Which are exactly topics that all the big LLM companies are working on.
They appear to be using an agentic approach similar to how LLMs are currently being integrated into other software products. Honestly, it doesn't even look like they are doing much that isn't part of OpenAI's APIs. Which is impressive. I saw speech capabilities, reasoning, visual inputs, function calls, etc. in action. Including the dreaded "thinking" pause where the Robot waits a few seconds for the remote GPUs to do their thing.
This is not about fine motor control but about replacing humans controlling robots with LLMs controlling robots and getting similarly good/ok results. As the article argues, the hardware is actually not perfect but good enough for a lot of tasks if it is controlled by a human. The hardware in this video is nothing special. Multiple companies have similar or better prototypes. Dexterity and balance are alright but probably not best in class. Best in class hardware is not the point of these demos.
Dexterity and real time feedback is less important than the reasoning and classification capabilities people have. The latency just means things go a bit slower. Watching these things shuffle around like an old person that needs to go to the bath room is a bit painful. But getting from A to B seems like a solved problem. A 2 or 3x speedup would be nice. 10x would be impressively fast. 100x would be scary and intimidating to have near you. I don't think that's going to be a challenge long term. Making LLMs faster is an easier problem than making them smarter.
Putting a coffee cup in a coffee machine (one of the demo videos) and then learning to fix it when it misaligns seems like an impressive capability. It compensates for precision and speed with adaptability and reasoning: analyze the camera input, correctly analyze the situation, problem and challenge come up with a plan to perform the task, execute the plan, re-evaluate, adapt, fix. It's a bit clumsy but the end result is coffee. Good demo and I can see how you might make it do all sorts of things that are vaguely useful that way.
The key point here is that knowing that the thing in front of the robot is a coffee cup and a coffee machine and identifying how those things fit together and in what context that is required are all things that LLMs can do.
Better feedback loops and hardware will make this faster, and less tedious to watch. Faster LLMs will help with that too. And better LLMs will result in less mistakes, better plans, etc. It seems both capabilities are improving at an enormously fast pace right now.
And a fine point with human intelligence is that we divide and conquer. Juggling is a lot harder when you start thinking about it. The thinking parts of your brain interferes with the lower level neural circuits involved with juggling. You'll drop the balls. The whole point with juggling is that you need to act faster than you can think. Like LLMs, we're too slow. But we can still learn to juggle. Juggling robots are going to be a thing.
1/11/2025
at
9:19:12 AM
>The key point here is that knowing that the thing in front of the robot is a coffee cup and a coffee machine and identifying how those things fit together and in what context that is required are all things that LLMs can do.I'm skeptical that any LLM "knows" any such thing. It's a Chinese Room. It's got a probability map that connects the lexeme (to us) 'coffee machine' and 'coffee cup' depending on other inputs that we do not and cannot access, and spits out sentences or images that (often) look right, but that does not equate to any understanding of what it is doing.
As I was writing this, I took chat GPT-4 for a spin. When I ask it about an obscure but once-popular fantasy character from the 70s cold, it admits it doesn't know. But, if I ask it about that same character after first asking about some obscure fantasy RPG characters, it cheerfully confabulates an authoritative and wrong answer. As always, if it does this on topics where I am a domain expert, I consider it absolutely untrustworthy for any topics on which I am not a domain expert. That anyone treats it otherwise seems like a baffling new form of Gell-Mann amnesia.
And for the record, when I asked ChatGPT-4, cold, "What is Gell-Mann amnesia?" it gave a multi-paragraph, broadly accurate description, with the following first paragraph:
"The Gell-Mann amnesia effect is a term coined by physicist Murray Gell-Mann. It refers to the phenomenon where people, particularly those who are knowledgeable in a specific field, read or encounter inaccurate information in the media, but then forget or dismiss it when it pertains to other topics outside their area of expertise. The term highlights the paradox where readers recognize the flaws in reporting when it’s something they are familiar with, yet trust the same source on topics outside their knowledge, even though similar inaccuracies may be present."
Those who are familiar with the term have likely already spotted the problem:
"a term coined by physicist Murray Gell-Mann". The term was coined by author Michael Crichton.[1] To paraphrase H.L. Mencken, for every moderately complex question, there is an LLM answer that is clear, simple, and wrong.
1. https://en.wikipedia.org/wiki/Michael_Crichton#Gell-Mann_amn...
by GolfPopper
1/11/2025
at
12:38:45 PM
Do we know how human understanding works? It could be just statistical mapping as you have framed it. You can’t say llms don’t understand when you don’t have a measurable definition for understanding.Also, humans hallucinate/confabulate all the time. Llms even forget in the same way humans do (strong recall in the start and end of the text but weaker in the middle)
by redlock
1/11/2025
at
10:50:49 AM
Hallucinations are a well known problem. And there are some mitigations that work pretty well. Mostly with enough context and prompt engineering, LLMs can be pretty reliable. And obscure popular fiction trivia is maybe not that relevant for every use case. Which would be robotics in this case; not the finer points of Michael Crighton related trivia.You were testing its knowledge, not its ability to reason or classify things it sees. I asked the same question to perplexity.ai. If you use the free version, it uses less advanced LLMs but it compensates with prompt engineering and making it do a search to come up with this answer:
> The Gell-Mann Amnesia effect is a psychological phenomenon that describes people's tendency to trust media reports on unfamiliar topics despite recognizing inaccuracies in articles about subjects they know well. This effect, coined by novelist Michael Crichton, highlights a cognitive bias in how we consume news and information.
Sounds good to me. And it got me a nice reference to something called the portal wiki, and another one for the same wikipedia article you cited. And a few more references. And it goes on a bit to explain how it works. And I get your finer point here that I shouldn't believe everything I read. Luckily, my supervisor worked hard to train that out of me when I was doing a Ph. D. back in the day. But fair point and well made.
Anyway, this is a good example of how to mitigate hallucination with this specific question (and similar ones). Kind of the use case perplexity.ai was made to solve. I use it a lot. In my experience it does a great job figuring out the right references and extracting information from those. It can even address some fairly detailed questions. But especially on the freemium plan, you will run into limitations related to reasoning with what it extracts (you can pay them to use better models). And it helps to click on the links it provides to double check.
For things that involve reasoning (like coding), I use different tools. Different topic so won't bore you with that.
But what figure.ai is doing, falls well in the scope of several things openai does very well that you can use via their API. It's not going to be perfect for everything. But there probably is a lot that it nails without too much effort. I've done some things with their APIs that worked fairly well at least.
by jillesvangurp
1/11/2025
at
5:10:02 PM
>> Good demo and I can see how you might make it do all sorts of things that are vaguely useful that way.Unfortunately since that's a demo you have most likely seen all the sorts of things that are vaguely useful and that can be done easily, or at all.
Edit: Btw, the coffee task video says that the "AI" is "end-to-end neural networks". If I understand correctly that means an LLM was not involved in carrying out the task. At most an LLM may have been used to trigger the activation of the task, that was learned by a different method, probably some kind of imitation learning with deep RL.
Also, to see how much of a tech demo this is: the robot starts already in position in front of a clear desk and a human brings the coffee machine, positions it just so, places the cup in the holder and places a single coffee pod just so. Then the robot takes the coffee pod from the empty desk and places it in the machine, then pushes the button. That's all the interaction of the robot with the machine. The human collects the cup and makes a thumbs up.
Consider for a moment how much different is this laboratory instance of the task from any real-world instance. In my kitchen the coffee machine is on a cluttered surface with tins of coffee, a toaster, sometimes the group left on the machine, etc. etc - and I don't even use coffee pods but loose coffee. The robot you see has been trained to put that one pod placed in that particular spot in that one machine placed just so in front of it. It would have to be trained all over again to carry out the same task on my machine, it is uncertain if it could learn it successfully after thousands of demonstrations (because of all the clutter), and even if it did, it would still have to learn it all over again if I moved the coffee machine, or moved the tins, or the toaster; let alone if you wanted it to use your coffee machine (different colour, make, size, shape, etc) in your kitchen (different chaotic environment) (no offense meant).
Take the other video of the "real world task". That's the robot shuffling across a flat, clean surface and picking up an empty crate to put in an empty conveyor belt. That's just not a real world task.
Those are tech demos and you should not put much faith in them. That kind of thing takes an insane amount of work to set up just for one video, you rarely see the outtakes and it very, very rarely generalises to real-world utility.
by YeGoblynQueenne