Prof. Ken Goldberg is President of the Robot Learning Foundation and Chair of the Berkeley AI Research (BAIR) Lab Steering Committee. He is co-founder of Ambi Robotics and Jacobi Robotics and is William S. Floyd Distinguished Chair of Engineering at UC Berkeley, where he leads research in robotics and automation: grasping, manipulation, and learning for applications in industry, homes, agriculture, and robot-assisted surgery. http://goldberg.berkeley.edu
AI is rapidly advancing the way we think, but we live in a material world. We still need to move things, make things, and maintain things. We need AI-driven robots to support an aging human population that doesn’t have enough workers. Large vision-language models based on internet-scale data can now pass the Turing Test for intelligence. In this sense, data has "solved" language and many claim that data has solved speech recognition and computer vision.
Will data also solve robotics? Rich Sutton points out in the "Bitter Lesson" that data and black-box “end-to-end” models have surpassed all the best-laid analytic work in AI. I accept that this trend will eventually produce general-purpose robots.
But the question is: when?
Using commonly-accepted metrics for converting word and image tokens into time, the amount of internet-scale data (texts and images) used to train contemporary large vision language models (VLMs) is on the order of 100,000 years – it would take a human that long to read or view it [2]. However, the data needed to train robots must combine video with robot motion commands: that data does not yet exist.
One way to collect robot data is teleoperation – where human “trainers” use remote control devices to painstakingly choose every motion of a robot as it performs a task – like folding a towel – over and over again. This is a variant of puppeteering, an ancient artform, that requires extensive human skill and patience. Unlike puppets however, robot joint angles can be precisely recorded, so the exact position history of each motor can be combined with videos from cameras that record the scene from different angles. The data for each “trial” or “trajectory” from start to finish includes a few minutes of video and the position history of all robot motors. Many companies are gearing up with fleets of robots and humans to collect data this way.
However, the largest such dataset reported so far is on the order of 1 year of data (it was collected in under a year by many human-robot systems). This data has been used to train large models and initial results are intriguiging. But this suggests that at current data-collection rates, a general-purpose robot, based on a ChatGPT-sized set of robot data, will be available in...100,000 years.
So how can we close this 100,000-year “Data Gap”?
Researchers are actively pursuing 2 additional methods for generating robot data: simulation and 3D analysis of internet videos.
Digital simulation today looks incredibly life-like – consider the special effects in action movies and the deepfakes generated by AI. It’s relatively easy to create life-like simulations of robot drones flying or robot dogs walking down stairs and doing backflips. Simulations can also provide videos and motor data to train large robot models. Simulation data works well for robots that fly or walk, or even for doing backflips. But it turns out that simulation is notoriously unreliable for robot manipulation.
This Sim2Real “gap” arises because physical manipulation involves precise and changing contacts between the edges and surfaces of objects and grippers, very small but important material deformations, and very nuanced and changing frictional forces due to microscopic surface variations.
These factors are extremely difficult to measure and to accurately model. But these very small errors result in simulation data that looks realistic but is physically inaccurate. A submillimeter inaccuracy can make the difference between carrying a glass of water and dropping it. Robots trained on simulation data can work well in simulation but they often fail when manipulating physical objects. Researchers agree that physically-accurate simulation of manipulation is a Grand Challenge.
The third potential source of robot data is videos on the Internet. YouTube includes about 35,000 years of videos. Many of these videos show people manipulating objects, cooking, stacking cups, folding laundry. However, it is extremely difficult to extract precise 3D motion from 2D videos. Computer vision researchers can approximately track the motion of human hands and objects in a video, but the same issues of noise and precision make data from videos unreliable for robot learning. Accurately “lifting” a video image back into 3D to recover precise finger and object motions is a Grand Challenge for computer vision that is not expected to be solved in the forseeable future.
There is a 4th option.
Robot data can be collected from real robots working with real objects in real environments. Industry has thousands of robots doing useful work around the clock. Today, little of this real robot production data is saved. This is partly because most industrial robots perform extremely repetitive tasks like automotive welding and spot-painting that do not vary much. Data to train large models is often diverse – think of the massive range of texts and images on the internet. General-purpose robots need a broad range of data with variations in tasks, objects, and environments.
But real general-purpose robots don’t exist yet, so we can’t collect real robot data from them.
One option is to bootstrap, starting with specific tasks like driving or e-commerce package sorting, where the objects vary but the task and environment don’t vary much, and gradually expanding as specific skills are mastered into adjacent skills. Some companies are developing such robots and putting them to work.
One example is Waymo, which has robot taxis operating in several US cities. These robots have “level 4” autonomy – they rely on human operators who log in remotely to guide robot taxis when unfamiliar circumstances arise.
Another example is Ambi Robotics, which has package sorting and stacking machines operating in postal and warehouse facilities. These robots are fully autonomous – but a few times an hour they drop a package. As with Waymo, human operators help out in such cases.
Both Waymo and Ambi have created a “data flywheel”, where working robots constantly collect data that is used to improve robot performance and to enable adjacent robot skills, like highway merging for Waymo and package stacking (very different from sorting) for Ambi.
One thing that Waymo and Ambi also have in common is that they don’t rely only on “end-to-end” AI models. These companies combine advances in AI and learning from data with rigorous engineering methods like inverse kinematics, 6d motion planning, and digital signal processing.
I call this GOFE (Good Old-Fashioned Engineering). GOFE was developed long before modern AI. GOFE is based on modularity, metrics, and step-by-step algorithms based on geometry and physics that can be fully understood and often guaranteed to perform reliably. GOFE includes Kalman Filters, RANSAC outlier rejection, PID and MPC controllers, etc [3].
Whereas “end-to-end” AI methods are “model-free”, GOFE is model-based. GOFE segments problems into modules, so that each module can be tested, fixed, or fine-tuned independently, and replaced when a better module becomes available. Model-free methods can be combined with model-based methods to “kickstart” robots to achieve the levels of reliability required for adoption in real commercial environments, where they can then begin generating real robot data. I’ve been told such a combination is what’s behind the current success of Waymo, and I know that a combination of model-free and GOFE is behind the success of Ambi. Waymo’s robot taxis are collecting vast amounts of real data, and over the past 4 years, Ambi has collected 22 years of real robot data as they have sorted over 100 million real packages [4].
As noted at the beginning, I don’t disagree with Rich Sutton – I believe that model-free AI will eventually surpass GOFE and that general-purpose robots will be common at some point in our future.
I look forward to that future and hope I get to see it.
But when will the general-purpose robots arrive? I’m not sure that the public (or investors) are willing to wait very long. For the next few years, the safest bet for closing the 100,000 year data gap is to get real robots into production by combining GOFE with model-free methods. These real robots can collect data as they perform useful work such as taxi driving and sorting packages. That high-quality data will improve their performance and enable robots to perform adjacent skills, spinning up the data flywheel until it collects enough data to enable general-purpose robots.
[1] Rich Sutton. The Bitter Lesson. 13 March 2019: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf
[2] Kevin Black (π, Physical Intelligence) on X (12 Nov 2024):
https://x.com/kvablack/status/1856373781603987655
[3] Justin Yu*, Tara Sadjadpour,* Abby O’Neill, Mehdi Khfifi, Lawrence Yunliang Chen, Richard Cheng, Ashwin Balakrishna, Thomas Kollar, Ken Goldberg. MANIP: Integrating Interactive Perception into Long-Horizon Robot Manipulation Systems. IEEE/RSJ International Conference on Robots and Systems (IROS), Abhu Dhabi, UAE. Oct 2024. https://drive.google.com/file/d/1a3PpWDwwXQ4ZkBEyVrjIlQ7Znr9XqXGv
[4] Vishal Satish, Jeff Mahler, and Ken Goldberg. PRIME-1: Scaling Large Robot
Data for Industrial Reliability. Ambi Robotics Blog. 30 Jan 2025.
https://www.ambirobotics.com/blog/prime-1-scaling-large-robot-data-for-industrial-reliability/
This talk and an earlier variation were presented at:
2025 FRR-NRI Annual Meeting
Copyright © 2025 NSF NRI-FRR Annual Meeting - All Rights Reserved.
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.