GR00T Architecture: A Systems Engineering Breakdown
19th February 2026
GR00T Architecture: A Systems Engineering Breakdown
GR00T is not just a VLM. It is a Perception → Reasoning → Control generator stack.
GR00T =
Vision (what robot sees)
+ Language (what robot understands)
+ State (what robot feels)
+ Action (what robot does)
+ Diffusion (how it generates smooth motion)
1. Vision Encoder (ViT)
Typically something like SigLIP 2 ViT.
Purpose: Converts camera images into visual embeddings.
Image (pixels)
→ patchify
→ transformer
→ visual tokens (e.g., 1152-dim vectors)
Why Needed: Robot must understand object location, pose, scene geometry, human motion, and grasp targets. Without this, the robot is blind.
Why Frozen: Already pretrained on billions of images. Visual features are general-purpose, expensive to retrain, and retraining risks catastrophic forgetting.
tune_vision_tower = False
2. Language Model Backbone (LLM)
Often a Qwen-like transformer.
Purpose: Provides multimodal reasoning, task planning, instruction following, and scene understanding by combining visual and text tokens for joint reasoning.
Why Needed: Robot must understand instructions like “pick the red cup,” infer intent, break tasks into sub-steps, and maintain context. This is the brain.
Why Frozen: Massive pretrained knowledge, expensive to fine-tune. GR00T does not need to rewrite language reasoning — only the control side needs adaptation.
tune_llm = False
3. MLP1 — Vision to Language Projector
Purpose: Translates visual features into LLM token space.
Vision output: 1152-dim
MLP1 → 2048-dim
Now matches LLM hidden size
Why Needed: The vision encoder and LLM operate in different embedding spaces. Without the projector they cannot communicate. Think of it as a translator between two models.
Why Frozen: It was pretrained during multimodal alignment. Retraining would break vision-language alignment.
4. State Encoder (Trainable)
Purpose: Encodes the robot’s internal state — joint angles, velocities, IMU, proprioception, and gripper force — into an embedding.
state vector → embedding
Why Needed: Vision alone is insufficient. The robot must know where its arms currently are, whether it is balanced, and whether the gripper is closed. This is proprioception.
Why Trainable: Every robot has different kinematics, sensors, and joint ordering. This component must adapt to the specific robot morphology.
tune_projector = True
5. Action Encoder (Trainable)
Purpose: Encodes previous actions taken. Motion is sequential — the robot must know what it just did to maintain temporal coherence. Without it, motion becomes unstable.
Why Trainable: Action space differs across robot types: 6-DOF arms, 29-DOF humanoids, wheeled robots. The encoder must adapt to each.
6. Action Decoder (Trainable)
Purpose: Converts the model’s latent representation into motor commands.
latent → joint torque / velocity / position targets
Why Needed: The LLM outputs abstract reasoning. The robot needs concrete outputs: 12 joint torques, 29 joint angles, velocity trajectories. The decoder maps reasoning to motor space.
Why Trainable: Every robot has different actuators, control frequencies, and motor dynamics. This component is inherently robot-specific.
7. Diffusion Model (AlternateVLDiT, 32 layers)
Purpose: Instead of predicting a single action, GR00T predicts a trajectory distribution — generating smooth, physically plausible action sequences.
Why Diffusion: Robotics motion is continuous. Single-step prediction is unstable and lacks temporal smoothness. Diffusion works as follows:
Noise → denoise → trajectory
This produces multi-step action sequences, smooth control, and robust recovery behavior.
Why Trainable: Motion dynamics depend on the robot, and the data distribution depends on the environment. Fine-tuning on robot demonstrations is required.
tune_diffusion_model = True
8. Position Embedding (Trainable)
Purpose: Encodes time position in sequence for visual tokens, state sequences, and action sequences. Without it, the transformer does not know order.
Why Trainable: Robotics sequences differ from text. Control horizon length varies and the embedding must adapt to rollout length.
add_pos_embed = True
Full Architecture Overview
Camera → Vision Encoder (Frozen)
↓
MLP1 (Frozen)
↓
Text → LLM Backbone (Frozen)
↓
────────────────────────────────
Robot State → State Encoder (Trainable)
Previous Actions → Action Encoder (Trainable)
────────────────────────────────
↓
Diffusion Model (Trainable)
↓
Action Decoder (Trainable)
↓
Motor Commands
Why This Design Works
NVIDIA separated the architecture into three distinct zones:
| Part | Role | Adaptation |
|---|---|---|
| Vision | General perception | Frozen |
| Language | General reasoning | Frozen |
| Control | Robot-specific | Trainable |
This enables few-shot robot adaptation, morphology transfer, faster training, and lower GPU cost.
Practical Impact for Unitree / MuJoCo Training
When training on a Unitree G1, you do NOT retrain vision or language. You only train the state encoder, action encoder and decoder, and the diffusion model. This dramatically reduces compute requirements.
Final Mental Model
Foundation Brain (Frozen)
+
Robot Body Adapter (Trainable)
=
GR00T
The brain stays stable. The body adapter learns.
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026