From Vision to Torques: How NVIDIA’s GR00T Stack Controls a Humanoid Robot

22nd February 2026

NVIDIA’s GR00T stack for humanoid robots has three layers: a Vision-Language-Action model that understands what to do, a whole-body controller that figures out how to move, and a physics simulator that validates it all before touching real hardware. Here’s how they connect.

The Full Stack

Human says: "Go pick up that box"
        ↓
GR00T N1.6 — Vision-Language-Action model
(sees the scene, reads the instruction, outputs motion plan)
        ↓
GEAR-SONIC — Whole-Body Controller
(tracks the motion plan with all 29 joints)
        ↓
MuJoCo — Physics Simulation
(validates the policy before deploying to a real Unitree G1)
        ↓
Real robot moves

Each layer solves a different problem. The VLA handles what to do. SONIC handles how to move. MuJoCo makes sure it won’t fall over before you try it on a $50k+ robot.

Layer 1: GR00T N1.6 — The Brain

GR00T N1.6 is a Vision-Language-Action (VLA) model with three components:

┌───────────────────────────────────────────────────────┐
│  EagleBackbone (FROZEN)                               │
│  ├── Vision Encoder (ViT) ──▶ image features          │
│  ├── Language Model (LLM)  ──▶ text features          │
│  └── MLP1 Projector        ──▶ fused features [2048d] │
└───────────────────────┬───────────────────────────────┘
                        ↓
┌───────────────────────────────────────────────────────┐
│  Action Head (TRAINABLE)                              │
│  ├── State Encoder ──▶ current robot state [1536d]    │
│  ├── Action Encoder ──▶ noisy trajectory [1024d]      │
│  ├── AlternateVLDiT (32 layers) ──▶ denoised actions  │
│  └── Action Decoder ──▶ joint targets [16 steps]      │
└───────────────────────────────────────────────────────┘

The frozen backbone (~7B parameters) processes a camera image and a text instruction like “pick the blue block” into a fused feature vector. The trainable action head takes that context plus the robot’s current joint state and uses flow matching (not standard diffusion) to predict 16 future action steps.

Key details:

Flow matching — learns a velocity field from noise to clean actions. The model predicts velocity = actions - noise, not the actions directly
Beta(1.5, 1.0) timestep sampling — biases training toward harder denoising tasks
Per-embodiment weights — the state/action encoders maintain separate weight matrices for different robot types (up to 32)
Fine-tuning is feasible on a single GPU because only the action head trains; the 7B vision-language backbone stays frozen

The VLA outputs high-level motion commands: walk forward, turn left, reach arm toward the object. It does not directly produce low-level joint torques — that’s SONIC’s job.

Layer 2: GEAR-SONIC — The Body

SONIC (Supersizing Motion Tracking for Natural Humanoid Whole-Body Control) takes the VLA’s motion plan and executes it with the robot’s actual joints.

Aspect	Old Approach (Decoupled WBC)	SONIC
Controls	Legs only (15 joints), arms locked	All 29 joints — legs, waist, arms, wrists
Trained on	RL reward functions	Large-scale human motion capture data
Architecture	Separate lower body RL + upper body IK	Single encoder-decoder network
Locomotion modes	Walk and balance	27 distinct styles from one checkpoint

Instead of hand-designing reward functions for “good walking,” SONIC learns by watching how humans move. The encoder compresses the robot’s current state into a latent representation. The decoder maps full-body motion references into 29 joint position targets. A kinematic planner generates those references at 10 Hz, while the policy runs at 50 Hz.

This is what enables the robot to walk, run, jump, crawl, stealth-walk, and do elbow crawls — all from the same neural network checkpoint.

Layer 3: MuJoCo — The Test Bench

Before any of this runs on a physical Unitree G1, it runs in MuJoCo. The simulation loop executes every 0.005s:

read sensors (joint positions, velocities, IMU)
        ↓
build observation vector (86 dims × 6 history frames = 516)
        ↓
ONNX policy inference → target joint angles
        ↓
PD controller → joint torques
        ↓
MuJoCo physics engine steps the simulation
        ↓
repeat

MuJoCo replaces the physical robot. Everything else — the policy weights, the PD controller gains, the observation pipeline — is identical to what runs on real hardware. The .onnx file is the same binary at every stage: training in Isaac Sim, validation in MuJoCo, deployment on the robot.

For the SONIC pipeline specifically, simulation runs as two processes communicating over DDS:

┌──────────────────────────────┐   DDS   ┌──────────────────────────────┐
│  Python / MuJoCo (200 Hz)    │ ◀─────▶ │  C++ / TensorRT (50 Hz)     │
│  Physics + PD torque control │         │  Encoder-decoder policy      │
│  Publishes robot state       │         │  Kinematic planner (10 Hz)   │
└──────────────────────────────┘         └──────────────────────────────┘

How the Layers Connect

In a full deployment, the three layers form a hierarchy:

Layer	Runs At	Input	Output
GR00T N1.6 (VLA)	~1-10 Hz	Camera image + language instruction	High-level motion commands
GEAR-SONIC (WBC)	50 Hz	Motion references + robot state	29 joint position targets
PD Controller	200 Hz	Joint targets + current positions	Joint torques

The VLA thinks slow — it sees the world, understands the task, and decides what motion to produce. SONIC reacts fast — it tracks that motion reference while keeping the robot balanced. The PD controller reacts fastest — it converts joint targets to torques at the motor level.

What I’ve been doing with the keyboard (WASD for direction, number keys for gait style) is acting as the VLA layer. In production, GR00T N1.6 generates those commands from camera images and natural language.

My Sim2Sim Setup

I put together a repo to run the SONIC layer end-to-end in MuJoCo: github.com/avparkhi/GEAR-SONIC-Sim2Sim

GEAR-SONIC-Sim2Sim/
├── gear_sonic/           → MuJoCo simulation environment + DDS bridge
├── gear_sonic_deploy/    → C++ TensorRT inference engine + ONNX models
├── decoupled_wbc/        → Legacy controller + VR teleoperation
└── install_scripts/      → Automated setup for Ubuntu/WSL2

Two terminals to launch:

# Terminal 1: Physics simulation
python gear_sonic/scripts/run_sim_loop.py

# Terminal 2: Policy inference
bash deploy.sh sim --input-type keyboard

Startup: ] to activate, 9 to drop the robot, Enter for planner mode, number keys to select gait, WASD to move. Eight locomotion modes available — from slow walk to elbow crawl.

Requirements: Ubuntu 22.04/24.04 or WSL2, NVIDIA GPU, and specifically TensorRT 10.7.0+cuda12.6 (newer versions break).

Akshay Parkhi's Weblog