Robbyant just released LingBot-VA, a robot control system that predicts what the world will look like after taking those actions, then checks if reality matches its prediction.
The approach works in three steps: predict future video frames based on current observations and language instructions, decode actions from the predicted video, then replace predictions with real observations after execution.
This creates a closed-loop where the robot continuously grounds itself in reality.
LingBot-VA excels across diverse tasks: long-horizon manipulation (make breakfast, unpack delivery), precision control (insert tubes, pick screws), and deformable objects (fold clothes, open drawer). Results on RoboTwin 2.0 and LIBERO benchmarks show consistent improvements over state-of-the-art methods.
What makes this powerful is long-term memory. Traditional models only see the current observation and get confused by repeated states.
For example, if a robot needs to open the right box, close it, then open the left box, a memoryless model sees the closed right box twice and gets stuck in loops. LingBot-VA remembers the full sequence and completes the task correctly.