VLA → WBC → MuJoCo: Two Ways to Wire Up NVIDIA’s GR00T Humanoid Stack
22nd February 2026
There are two ways to wire up NVIDIA’s GR00T stack from vision-language all the way down to physics simulation: the official NVIDIA eval pipeline and a custom pipeline using the SONIC C++ binary. I’ve set up both. Here’s how they work and where they differ.
The Full Pipeline
Both approaches follow the same logical flow:
Camera image + "pick up the apple"
↓
GR00T N1.6 VLA (vision-language-action model)
↓
High-level actions: arm targets, hand poses, navigate commands
↓
Whole-Body Controller (converts to 29-DOF joint torques)
↓
MuJoCo physics simulation
↓
Robot moves in sim
The difference is how the whole-body controller runs — Python with ONNX Runtime, or C++ with TensorRT.
Approach 1: Official NVIDIA Eval Pipeline
This is what NVIDIA ships for evaluating VLA models against RoboCasa tasks.
┌──────────────────────┐ ZMQ (port 5555) ┌──────────────────────────┐
│ Terminal 1 │◀──────────────────▶│ Terminal 2 │
│ │ obs → actions │ │
│ VLA Server │ │ Rollout + MuJoCo │
│ GR00T N1.6 │ │ RoboCasa scene │
│ (GPU inference) │ │ (tables/apple/plate) │
│ │ │ │
│ PolicyServer │ │ WholeBodyControlWrapper │
│ run_gr00t_server.py │ │ (Python WBC) │
└──────────────────────┘ └──────────────────────────┘
The step-by-step flow:
- MuJoCo scene initializes with a RoboCasa environment (kitchen counter, objects to manipulate)
rollout_policy.pypackages the observation:video.ego_view(camera image) +state.*(43-DOF joint positions) + task description (language instruction)- Sends observation to VLA server via ZMQ
PolicyClient - VLA model infers actions across multiple body groups:
action.left_arm (7 DOF)
action.right_arm (7 DOF)
action.left_hand (7 DOF)
action.right_hand (7 DOF)
action.waist (3 DOF)
action.navigate_command (3 DOF)
action.base_height_command (1 DOF)
WholeBodyControlWrapper(Python WBC using Balance + Walk ONNX models) converts VLA actions into 29-DOF low-level joint torques- MuJoCo steps the physics
The WBC here is Python-based (gr00t_wbc package) using the same GR00T-WholeBodyControl-Balance.onnx and GR00T-WholeBodyControl-Walk.onnx policies via ONNX Runtime. It is not the SONIC C++ binary.
Approach 2: Custom VLA → SONIC C++ → MuJoCo
This is closer to how it would actually run on a real robot — the SONIC C++ binary handles whole-body control at 50 Hz, the same way it would on the robot’s onboard computer.
┌───────────────┐ ZMQ img ┌────────────────┐ ZMQ planner ┌───────────────┐
│ Terminal 1 │──(5555)──▶│ Terminal 3 │───(5556)────▶│ Terminal 2 │
│ │ │ │ │ │
│ MuJoCo sim │ DDS │ Bridge │ │ SONIC C++ │
│ gear_sonic/ │ state │ bridge.py │ │ deploy_onnx │
│ run_sim_ │◀──────────────────────────────DDS action──│ │
│ loop.py │ │ GR00T N1.6 │ │ Policy(50Hz) │
│ │ │ (GPU inference)│ │ Encoder │
│ Camera pub │ │ │ │ Planner(10Hz)│
│ DDS state │ │ │ │ TRT/ONNX │
└───────────────┘ └────────────────┘ └───────────────┘
The flow:
- MuJoCo publishes camera frames via ZMQ (port 5555) and robot state via DDS
- Bridge receives image + state, runs GR00T N1.6 VLA inference, outputs navigate/arm/hand commands
- Bridge publishes planner trajectory to SONIC via ZMQ (port 5556)
- SONIC C++ binary runs encoder-decoder at 50 Hz, publishes 29-DOF joint targets back over DDS
- MuJoCo applies PD torque control and steps physics at 200 Hz
Three terminals instead of two, but the SONIC binary is the same one that runs on real hardware.
Comparing the Two Approaches
| Aspect | Approach 1 (Official) | Approach 2 (Custom) |
|---|---|---|
| WBC runtime | Python gr00t_wbc | C++ SONIC binary |
| WBC speed | ~10 Hz | 50 Hz policy decoder |
| Communication | ZMQ only | ZMQ + DDS (CycloneDDS) |
| Terminals | 2 | 3 |
| Use case | Eval only | Closer to real-robot deployment |
| Setup complexity | Simpler (pip install) | Requires TensorRT 10.7.0, C++ build |
Same Brain, Different Runtime
An important detail: Approach 1 does not use SONIC, but it uses the same ONNX model weights that SONIC uses — GR00T-WholeBodyControl-Balance.onnx and GR00T-WholeBodyControl-Walk.onnx. These are loaded in Python via ONNX Runtime inside the WholeBodyControlWrapper, instead of being run through the C++ SONIC binary with TensorRT.
Same neural network, same weights, different inference engine. The Python path is simpler and sufficient for simulation evaluation. The C++ path is what you’d use when milliseconds matter on real hardware.
Links
- My sim2sim repo: github.com/avparkhi/GEAR-SONIC-Sim2Sim
- SONIC paper: arxiv.org/abs/2511.07820
- Model weights: huggingface.co/nvidia/GEAR-SONIC
- GR00T WBC docs: nvlabs.github.io/GR00T-WholeBodyControl
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026