From Vision to Torques: How NVIDIA’s GR00T Stack Controls a Humanoid Robot
22nd February 2026
NVIDIA’s GR00T stack for humanoid robots has three layers: a Vision-Language-Action model that understands what to do, a whole-body controller that figures out how to move, and a physics simulator that validates it all before touching real hardware. Here’s how they connect.
The Full Stack
Human says: "Go pick up that box"
↓
GR00T N1.6 — Vision-Language-Action model
(sees the scene, reads the instruction, outputs motion plan)
↓
GEAR-SONIC — Whole-Body Controller
(tracks the motion plan with all 29 joints)
↓
MuJoCo — Physics Simulation
(validates the policy before deploying to a real Unitree G1)
↓
Real robot moves
Each layer solves a different problem. The VLA handles what to do. SONIC handles how to move. MuJoCo makes sure it won’t fall over before you try it on a $50k+ robot.
Layer 1: GR00T N1.6 — The Brain
GR00T N1.6 is a Vision-Language-Action (VLA) model with three components:
┌───────────────────────────────────────────────────────┐
│ EagleBackbone (FROZEN) │
│ ├── Vision Encoder (ViT) ──▶ image features │
│ ├── Language Model (LLM) ──▶ text features │
│ └── MLP1 Projector ──▶ fused features [2048d] │
└───────────────────────┬───────────────────────────────┘
↓
┌───────────────────────────────────────────────────────┐
│ Action Head (TRAINABLE) │
│ ├── State Encoder ──▶ current robot state [1536d] │
│ ├── Action Encoder ──▶ noisy trajectory [1024d] │
│ ├── AlternateVLDiT (32 layers) ──▶ denoised actions │
│ └── Action Decoder ──▶ joint targets [16 steps] │
└───────────────────────────────────────────────────────┘
The frozen backbone (~7B parameters) processes a camera image and a text instruction like “pick the blue block” into a fused feature vector. The trainable action head takes that context plus the robot’s current joint state and uses flow matching (not standard diffusion) to predict 16 future action steps.
Key details:
- Flow matching — learns a velocity field from noise to clean actions. The model predicts
velocity = actions - noise, not the actions directly - Beta(1.5, 1.0) timestep sampling — biases training toward harder denoising tasks
- Per-embodiment weights — the state/action encoders maintain separate weight matrices for different robot types (up to 32)
- Fine-tuning is feasible on a single GPU because only the action head trains; the 7B vision-language backbone stays frozen
The VLA outputs high-level motion commands: walk forward, turn left, reach arm toward the object. It does not directly produce low-level joint torques — that’s SONIC’s job.
Layer 2: GEAR-SONIC — The Body
SONIC (Supersizing Motion Tracking for Natural Humanoid Whole-Body Control) takes the VLA’s motion plan and executes it with the robot’s actual joints.
| Aspect | Old Approach (Decoupled WBC) | SONIC |
|---|---|---|
| Controls | Legs only (15 joints), arms locked | All 29 joints — legs, waist, arms, wrists |
| Trained on | RL reward functions | Large-scale human motion capture data |
| Architecture | Separate lower body RL + upper body IK | Single encoder-decoder network |
| Locomotion modes | Walk and balance | 27 distinct styles from one checkpoint |
Instead of hand-designing reward functions for “good walking,” SONIC learns by watching how humans move. The encoder compresses the robot’s current state into a latent representation. The decoder maps full-body motion references into 29 joint position targets. A kinematic planner generates those references at 10 Hz, while the policy runs at 50 Hz.
This is what enables the robot to walk, run, jump, crawl, stealth-walk, and do elbow crawls — all from the same neural network checkpoint.
Layer 3: MuJoCo — The Test Bench
Before any of this runs on a physical Unitree G1, it runs in MuJoCo. The simulation loop executes every 0.005s:
read sensors (joint positions, velocities, IMU)
↓
build observation vector (86 dims × 6 history frames = 516)
↓
ONNX policy inference → target joint angles
↓
PD controller → joint torques
↓
MuJoCo physics engine steps the simulation
↓
repeat
MuJoCo replaces the physical robot. Everything else — the policy weights, the PD controller gains, the observation pipeline — is identical to what runs on real hardware. The .onnx file is the same binary at every stage: training in Isaac Sim, validation in MuJoCo, deployment on the robot.
For the SONIC pipeline specifically, simulation runs as two processes communicating over DDS:
┌──────────────────────────────┐ DDS ┌──────────────────────────────┐
│ Python / MuJoCo (200 Hz) │ ◀─────▶ │ C++ / TensorRT (50 Hz) │
│ Physics + PD torque control │ │ Encoder-decoder policy │
│ Publishes robot state │ │ Kinematic planner (10 Hz) │
└──────────────────────────────┘ └──────────────────────────────┘
How the Layers Connect
In a full deployment, the three layers form a hierarchy:
| Layer | Runs At | Input | Output |
|---|---|---|---|
| GR00T N1.6 (VLA) | ~1-10 Hz | Camera image + language instruction | High-level motion commands |
| GEAR-SONIC (WBC) | 50 Hz | Motion references + robot state | 29 joint position targets |
| PD Controller | 200 Hz | Joint targets + current positions | Joint torques |
The VLA thinks slow — it sees the world, understands the task, and decides what motion to produce. SONIC reacts fast — it tracks that motion reference while keeping the robot balanced. The PD controller reacts fastest — it converts joint targets to torques at the motor level.
What I’ve been doing with the keyboard (WASD for direction, number keys for gait style) is acting as the VLA layer. In production, GR00T N1.6 generates those commands from camera images and natural language.
My Sim2Sim Setup
I put together a repo to run the SONIC layer end-to-end in MuJoCo: github.com/avparkhi/GEAR-SONIC-Sim2Sim
GEAR-SONIC-Sim2Sim/
├── gear_sonic/ → MuJoCo simulation environment + DDS bridge
├── gear_sonic_deploy/ → C++ TensorRT inference engine + ONNX models
├── decoupled_wbc/ → Legacy controller + VR teleoperation
└── install_scripts/ → Automated setup for Ubuntu/WSL2
Two terminals to launch:
# Terminal 1: Physics simulation
python gear_sonic/scripts/run_sim_loop.py
# Terminal 2: Policy inference
bash deploy.sh sim --input-type keyboard
Startup: ] to activate, 9 to drop the robot, Enter for planner mode, number keys to select gait, WASD to move. Eight locomotion modes available — from slow walk to elbow crawl.
Requirements: Ubuntu 22.04/24.04 or WSL2, NVIDIA GPU, and specifically TensorRT 10.7.0+cuda12.6 (newer versions break).
Links
- Sim2Sim repo: github.com/avparkhi/GEAR-SONIC-Sim2Sim
- SONIC paper: arxiv.org/abs/2511.07820
- Model weights: huggingface.co/nvidia/GEAR-SONIC
- GR00T WBC docs: nvlabs.github.io/GR00T-WholeBodyControl
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026