Akshay Parkhi's Weblog

Subscribe

VLA → WBC → MuJoCo: Two Ways to Wire Up NVIDIA’s GR00T Humanoid Stack

22nd February 2026

There are two ways to wire up NVIDIA’s GR00T stack from vision-language all the way down to physics simulation: the official NVIDIA eval pipeline and a custom pipeline using the SONIC C++ binary. I’ve set up both. Here’s how they work and where they differ.

The Full Pipeline

Both approaches follow the same logical flow:

Camera image + "pick up the apple"
        ↓
GR00T N1.6 VLA (vision-language-action model)
        ↓
High-level actions: arm targets, hand poses, navigate commands
        ↓
Whole-Body Controller (converts to 29-DOF joint torques)
        ↓
MuJoCo physics simulation
        ↓
Robot moves in sim

The difference is how the whole-body controller runs — Python with ONNX Runtime, or C++ with TensorRT.

Approach 1: Official NVIDIA Eval Pipeline

This is what NVIDIA ships for evaluating VLA models against RoboCasa tasks.

┌──────────────────────┐   ZMQ (port 5555)   ┌──────────────────────────┐
│  Terminal 1           │◀──────────────────▶│  Terminal 2               │
│                       │   obs → actions     │                          │
│  VLA Server           │                     │  Rollout + MuJoCo        │
│  GR00T N1.6           │                     │  RoboCasa scene          │
│  (GPU inference)      │                     │  (tables/apple/plate)    │
│                       │                     │                          │
│  PolicyServer         │                     │  WholeBodyControlWrapper │
│  run_gr00t_server.py  │                     │  (Python WBC)            │
└──────────────────────┘                     └──────────────────────────┘

The step-by-step flow:

  1. MuJoCo scene initializes with a RoboCasa environment (kitchen counter, objects to manipulate)
  2. rollout_policy.py packages the observation: video.ego_view (camera image) + state.* (43-DOF joint positions) + task description (language instruction)
  3. Sends observation to VLA server via ZMQ PolicyClient
  4. VLA model infers actions across multiple body groups:
action.left_arm         (7 DOF)
action.right_arm        (7 DOF)
action.left_hand        (7 DOF)
action.right_hand       (7 DOF)
action.waist            (3 DOF)
action.navigate_command (3 DOF)
action.base_height_command (1 DOF)
  1. WholeBodyControlWrapper (Python WBC using Balance + Walk ONNX models) converts VLA actions into 29-DOF low-level joint torques
  2. MuJoCo steps the physics

The WBC here is Python-based (gr00t_wbc package) using the same GR00T-WholeBodyControl-Balance.onnx and GR00T-WholeBodyControl-Walk.onnx policies via ONNX Runtime. It is not the SONIC C++ binary.

Approach 2: Custom VLA → SONIC C++ → MuJoCo

This is closer to how it would actually run on a real robot — the SONIC C++ binary handles whole-body control at 50 Hz, the same way it would on the robot’s onboard computer.

┌───────────────┐  ZMQ img   ┌────────────────┐  ZMQ planner  ┌───────────────┐
│  Terminal 1    │──(5555)──▶│  Terminal 3     │───(5556)────▶│  Terminal 2    │
│                │           │                 │              │               │
│  MuJoCo sim   │  DDS      │  Bridge         │              │  SONIC C++    │
│  gear_sonic/   │  state    │  bridge.py      │              │  deploy_onnx  │
│  run_sim_      │◀──────────────────────────────DDS action──│               │
│  loop.py       │           │  GR00T N1.6     │              │  Policy(50Hz) │
│                │           │  (GPU inference)│              │  Encoder      │
│  Camera pub    │           │                 │              │  Planner(10Hz)│
│  DDS state     │           │                 │              │  TRT/ONNX     │
└───────────────┘           └────────────────┘              └───────────────┘

The flow:

  1. MuJoCo publishes camera frames via ZMQ (port 5555) and robot state via DDS
  2. Bridge receives image + state, runs GR00T N1.6 VLA inference, outputs navigate/arm/hand commands
  3. Bridge publishes planner trajectory to SONIC via ZMQ (port 5556)
  4. SONIC C++ binary runs encoder-decoder at 50 Hz, publishes 29-DOF joint targets back over DDS
  5. MuJoCo applies PD torque control and steps physics at 200 Hz

Three terminals instead of two, but the SONIC binary is the same one that runs on real hardware.

Comparing the Two Approaches

AspectApproach 1 (Official)Approach 2 (Custom)
WBC runtimePython gr00t_wbcC++ SONIC binary
WBC speed~10 Hz50 Hz policy decoder
CommunicationZMQ onlyZMQ + DDS (CycloneDDS)
Terminals23
Use caseEval onlyCloser to real-robot deployment
Setup complexitySimpler (pip install)Requires TensorRT 10.7.0, C++ build

Same Brain, Different Runtime

An important detail: Approach 1 does not use SONIC, but it uses the same ONNX model weights that SONIC uses — GR00T-WholeBodyControl-Balance.onnx and GR00T-WholeBodyControl-Walk.onnx. These are loaded in Python via ONNX Runtime inside the WholeBodyControlWrapper, instead of being run through the C++ SONIC binary with TensorRT.

Same neural network, same weights, different inference engine. The Python path is simpler and sufficient for simulation evaluation. The C++ path is what you’d use when milliseconds matter on real hardware.

Links

This is VLA → WBC → MuJoCo: Two Ways to Wire Up NVIDIA’s GR00T Humanoid Stack by Akshay Parkhi, posted on 22nd February 2026.

Next: Teaching a Humanoid Robot to Wave: Custom Motions with GEAR-SONIC

Previous: From Vision to Torques: How NVIDIA's GR00T Stack Controls a Humanoid Robot