NVIDIA Robotics just released SONIC (Supersizing Motion Tracking for Natural Humanoid Whole-Body Control), a foundation model for humanoid control trained on 100M+ frames of motion data (700 hours) using 9,000 GPU hours.
Instead of manual reward engineering for each skill, use dense supervision from diverse motion-capture data to acquire human motion priors.
They scaled along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames), and compute (9k GPU hours). Performance improves steadily with all three.
VR teleoperation with just three tracking points (head and hands) for upper body plus kinematic planner for lower body. Video teleoperation tracking kung fu and crawling motions in real-time. Text-to-motion commands like "walk backward" or "move like a monkey."
They connected it to NVIDIA's GR00T N1.5 VLA foundation model, achieving 95% success rate on mobile manipulation tasks. This combines high-level reasoning with fast, reactive whole-body control.
The scaling properties are what matter. Unlike prior humanoid controllers (modest size, limited behaviors, days of training), SONIC shows that bigger models + more data + more compute = better generalist control. Learned representations generalize to unseen motions.
Train one large model on diverse motion data, then interface it with multiple input modalities through a universal token space.