GR00T Architecture: A Systems Engineering Breakdown

19th February 2026

GR00T Architecture: A Systems Engineering Breakdown

GR00T is not just a VLM. It is a Perception → Reasoning → Control generator stack.

GR00T =
  Vision (what robot sees)
+ Language (what robot understands)
+ State (what robot feels)
+ Action (what robot does)
+ Diffusion (how it generates smooth motion)

1. Vision Encoder (ViT)

Typically something like SigLIP 2 ViT.

Purpose: Converts camera images into visual embeddings.

Image (pixels)
→ patchify
→ transformer
→ visual tokens (e.g., 1152-dim vectors)

Why Needed: Robot must understand object location, pose, scene geometry, human motion, and grasp targets. Without this, the robot is blind.

Why Frozen: Already pretrained on billions of images. Visual features are general-purpose, expensive to retrain, and retraining risks catastrophic forgetting.

tune_vision_tower = False

2. Language Model Backbone (LLM)

Often a Qwen-like transformer.

Purpose: Provides multimodal reasoning, task planning, instruction following, and scene understanding by combining visual and text tokens for joint reasoning.

Why Needed: Robot must understand instructions like “pick the red cup,” infer intent, break tasks into sub-steps, and maintain context. This is the brain.

Why Frozen: Massive pretrained knowledge, expensive to fine-tune. GR00T does not need to rewrite language reasoning — only the control side needs adaptation.

tune_llm = False

3. MLP1 — Vision to Language Projector

Purpose: Translates visual features into LLM token space.

Vision output: 1152-dim
MLP1 → 2048-dim
Now matches LLM hidden size

Why Needed: The vision encoder and LLM operate in different embedding spaces. Without the projector they cannot communicate. Think of it as a translator between two models.

Why Frozen: It was pretrained during multimodal alignment. Retraining would break vision-language alignment.

4. State Encoder (Trainable)

Purpose: Encodes the robot’s internal state — joint angles, velocities, IMU, proprioception, and gripper force — into an embedding.

state vector → embedding

Why Needed: Vision alone is insufficient. The robot must know where its arms currently are, whether it is balanced, and whether the gripper is closed. This is proprioception.

Why Trainable: Every robot has different kinematics, sensors, and joint ordering. This component must adapt to the specific robot morphology.

tune_projector = True

5. Action Encoder (Trainable)

Purpose: Encodes previous actions taken. Motion is sequential — the robot must know what it just did to maintain temporal coherence. Without it, motion becomes unstable.

Why Trainable: Action space differs across robot types: 6-DOF arms, 29-DOF humanoids, wheeled robots. The encoder must adapt to each.

6. Action Decoder (Trainable)

Purpose: Converts the model’s latent representation into motor commands.

latent → joint torque / velocity / position targets

Why Needed: The LLM outputs abstract reasoning. The robot needs concrete outputs: 12 joint torques, 29 joint angles, velocity trajectories. The decoder maps reasoning to motor space.

Why Trainable: Every robot has different actuators, control frequencies, and motor dynamics. This component is inherently robot-specific.

7. Diffusion Model (AlternateVLDiT, 32 layers)

Purpose: Instead of predicting a single action, GR00T predicts a trajectory distribution — generating smooth, physically plausible action sequences.

Why Diffusion: Robotics motion is continuous. Single-step prediction is unstable and lacks temporal smoothness. Diffusion works as follows:

Noise → denoise → trajectory

This produces multi-step action sequences, smooth control, and robust recovery behavior.

Why Trainable: Motion dynamics depend on the robot, and the data distribution depends on the environment. Fine-tuning on robot demonstrations is required.

tune_diffusion_model = True

8. Position Embedding (Trainable)

Purpose: Encodes time position in sequence for visual tokens, state sequences, and action sequences. Without it, the transformer does not know order.

Why Trainable: Robotics sequences differ from text. Control horizon length varies and the embedding must adapt to rollout length.

add_pos_embed = True

Full Architecture Overview

Camera → Vision Encoder (Frozen)
              ↓
           MLP1 (Frozen)
              ↓
Text → LLM Backbone (Frozen)
              ↓
────────────────────────────────
Robot State → State Encoder (Trainable)
Previous Actions → Action Encoder (Trainable)
────────────────────────────────
              ↓
      Diffusion Model (Trainable)
              ↓
      Action Decoder (Trainable)
              ↓
           Motor Commands

Why This Design Works

NVIDIA separated the architecture into three distinct zones:

Part	Role	Adaptation
Vision	General perception	Frozen
Language	General reasoning	Frozen
Control	Robot-specific	Trainable

This enables few-shot robot adaptation, morphology transfer, faster training, and lower GPU cost.

Practical Impact for Unitree / MuJoCo Training

When training on a Unitree G1, you do NOT retrain vision or language. You only train the state encoder, action encoder and decoder, and the diffusion model. This dramatically reduces compute requirements.

Final Mental Model

Foundation Brain (Frozen)
+
Robot Body Adapter (Trainable)
=
GR00T

The brain stays stable. The body adapter learns.

Posted 19th February 2026 at 1:04 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog