Why Rhoda's bet on video prediction matters

Rhoda AI raised $450 million last month and is making a specific bet: that the way they're teaching robots to move is better than how everyone else is doing it. I'm trying to understand what they actually mean, and whether it changes how I think about robotics.

The founding story is worth paying attention to. Jagdeep Singh built QuantumScape from the ground up—took it public, got it to substantial value in solid-state batteries. Before that he founded Infinera, which Nokia bought for $2.3 billion. He knows how to build deep-tech companies through hardware cycles. He also knows what it takes to move from lab to production. Eric Chan, the Chief Science Officer, came from WorldLabs working on generative models. Gordon Wetzstein from Stanford leads computational imaging. Vinod Khosla, who incubated the company, is backing it. These aren't people throwing darts at a board. Singh spent 18 months in stealth before emerging. That's not typical for a venture-backed robotics startup. It suggests they were building something, not just fundraising.

The standard approach right now goes like this. Take a vision-language model that's already learned from billions of images and text on the internet. Fine-tune it on robot demonstration videos. Add a diffusion decoder that predicts motor commands. That's what Google, NVIDIA, Physical Intelligence, and Figure AI are all building. It works, and it's become the consensus design.

Rhoda's doing something different. They pre-train on internet video alone—no robot data yet—to learn how the physical world actually moves. Then they predict what the next few frames will look like, work backwards to figure out what actions would produce those frames, and do that whole cycle every few hundred milliseconds. Run it in a tight feedback loop.

The difference matters because it's a different way of thinking about control. The standard approach is "I understand what's happening semantically, now what should I do?" Rhoda's is "I can imagine what the world should look like next, now what actions get me there?"

The latency problem is real. Diffusion-based action decoders—which is what most VLA models use—require multiple passes through the network to refine predictions. Your control loop ends up running at maybe 1 or 2 cycles per second. For a lot of manufacturing that's fine. But when timing matters—welding where speed is critical, handling parts that shift during grasping—1 Hz is slow.

Rhoda says their system runs at 10 Hz closed-loop. If that's true and if it actually translates to better performance when things are moving, that's meaningful. If the computational cost of generating video frames every hundred milliseconds eats up the advantage, or if it needs so much computing power that it's impractical to run on the robot itself, then the architecture doesn't matter.

The embodiment problem both approaches face is real though. Hundreds of millions of YouTube videos teach you physics. But a human hand works nothing like a robot gripper. A human eye has different optics and refresh rate than a robot camera. You can't learn robot kinematics from watching people do things. Both Rhoda and the VLA companies claim they fine-tune on as little as 10 hours of robot-specific data. That sounds right based on what I'm reading. But what happens when you try it on different hardware? On different types of tasks? At scale? I don't know yet.

The business model Rhoda's pursuing—build your own hardware but license the intelligence layer to other manufacturers—makes sense. Boston Dynamics, Figure, Chinese makers, plus dozens of smaller startups will all build different robots. If Rhoda can actually make their model work across different hardware, that's defensible business strategy.

But NVIDIA and Physical Intelligence are pursuing the exact same thing. GR00T is explicitly designed for hardware-agnostic deployment. π0 has shown it works across different robot types. The licensing moat only holds if Rhoda's implementation is actually superior, not just different.

What I actually don't know: can they run 10 Hz closed-loop control on the robot itself without needing a datacenter's worth of GPU? Video generation is expensive. If you need cloud servers to predict frames, the latency advantage disappears. That's the claim that needs real data.

Second: does video prediction actually generalise better across different hardware and tasks, or does it introduce different failure modes? A VLA model struggles when things move fast but handles semantic variation well. Rhoda's might be the opposite. I won't know until there's production data.

Third: what does this actually look like in real manufacturing? They claim robots are already operating in production. That's the only claim worth paying attention to. Not simulation benchmarks. Not 10-hour fine-tuning on lab hardware. Real factories with real variability and time pressure and actual failure modes.

The reason I'm tracking this is that the robotics foundation model space isn't settled. There are multiple viable architectural approaches, all funded at scale, all claiming deployments. Over the next 12-18 months, the one that delivers on real manufacturing tasks at reasonable cost will shape what gets built next.

More practically, if you're advising on robotics investment or vendor selection, you need to understand why Rhoda made this architectural choice and what the trade-offs actually are. You can't evaluate vendors honestly if you don't understand the different paths they're taking and what those choices imply for your specific work.

Rhoda made the fork visible by being explicit about it. The work now is tracking whether the bet actually works, and asking every vendor hard questions about what they're optimising for.


References

Bloomberg. "AI Robotics Startup Rhoda Valued at $1.7B in New Funding." 10 March 2026, bloomberg.com/news/articles/2026-03-10/ai-robotics-startup-rhoda-valued-at-1-7-billion-in-new-funding.

The Robot Report. "Rhoda AI exits stealth with $450M to train robots from video." 12 March 2026, therobotreport.com/rhoda-ai-exits-stealth-with-450m-to-train-robots-from-video/.

ASSEMBLY Magazine. "New Robotic AI Platform Targets High-Variability Manufacturing Tasks." 10 March 2026, assemblymag.com/articles/99908-new-robotic-ai-platform-targets-high-variability-manufacturing-tasks/.

Kawaharazuka, Kento, et al. "Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications." IEEE Access, vol. 13, 2025, pp. 162467–162504.

Li, X., et al. "What matters in building vision–language–action models for generalist robots." Nature Machine Intelligence, vol. 8, 2026, pp. 158–172.

Disclaimer: This blog reflects my personal views only. Content does not represent the views of my employer, Info-Tech Research Group.