GR00T N1.6 Fine-Tuning — Full Internal Deep Dive
19th February 2026
GR00T N1.6 is NVIDIA’s Vision-Language-Action (VLA) model for humanoid robot control. After spending time digging through the internals, here’s a comprehensive deep dive into exactly how fine-tuning works — from model architecture to gradient flow to the data pipeline.
1. Model Architecture (3 Major Components)
GR00T N1.6 is built from three major components arranged in a pipeline:
┌─────────────────────────────────────────────────────────────┐
│ Gr00tN1d6 Model │
│ │
│ ┌──────────────────────────────────┐ │
│ │ A. EagleBackbone (FROZEN) │ │
│ │ ├── Vision Encoder (ViT) │──▶ image features │
│ │ ├── Language Model (LLM) │──▶ text features │
│ │ └── MLP1 Projector │──▶ fused features │
│ │ Output: [B, seq_len, 2048] │ │
│ └──────────────────────────────────┘ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ B. Gr00tN1d6ActionHead (TRAINABLE) │ │
│ │ ├── State Encoder (CategorySpecificMLP) │ │
│ │ ├── Action Encoder (MultiEmbodimentActionEncoder) │ │
│ │ ├── Diffusion Model (AlternateVLDiT, 32 layers) │ │
│ │ └── Action Decoder (CategorySpecificMLP) │ │
│ └──────────────────────────────────────────────────────┘ │
│ ↓ │
│ predicted action trajectory [B, 16, action_dim]│
└─────────────────────────────────────────────────────────────┘
The EagleBackbone handles vision and language understanding — it encodes camera images and text instructions into a fused feature representation. The Action Head takes those features, combines them with the robot’s current state, and uses a diffusion transformer to predict a trajectory of 16 future actions.
2. Which Layers Are Frozen vs Trainable
This is the key design decision that makes fine-tuning feasible on a single GPU:
| Component | Frozen | Trainable | Config Flag |
|---|---|---|---|
| Vision Encoder (ViT) | YES | No | tune_vision_tower=False |
| Language Model (LLM backbone) | YES | No | tune_llm=False |
| MLP1 (vision→language projector) | YES | No | Part of vision tower |
| State Encoder | No | YES | tune_projector=True |
| Action Encoder | No | YES | tune_projector=True |
| Action Decoder | No | YES | tune_projector=True |
| Diffusion Model (AlternateVLDiT, 32 layers) | No | YES | tune_diffusion_model=True |
| Position Embedding | No | YES | add_pos_embed=True |
The entire vision-language backbone (~7B parameters) stays frozen. Only the action head (diffusion model + state/action encoders/decoders) gets fine-tuned. No LoRA is used by default — use_backbone_lora=0, use_llm_lora=0. The action head is fully fine-tuned with all parameters.
3. Frozen Module Handling During Training
Even though frozen parameters have requires_grad=False, the model explicitly sets them to eval mode at the start of every forward pass:
# In Gr00tN1d6ActionHead, at each forward():
if self.training:
if not self.tune_diffusion_model:
self.model.eval() # disable dropout in DiT
if not self.tune_projector:
self.state_encoder.eval()
self.action_encoder.eval()
self.action_decoder.eval()
This ensures dropout and batch norm in frozen layers are disabled, giving deterministic features to the trainable layers.
4. The Forward Pass (Training) — Step by Step
Step 4a: Vision-Language Processing (Frozen Backbone)
ego_view image (480×640)
→ Eagle any-res processor (resize, crop, tile)
→ ViT encoder → image features
"pick the blue block"
→ Tokenizer → token IDs
→ LLM backbone (truncated at layer 16) → text features
→ MLP1 projector fuses them → backbone_features [B, seq_len, 2048]
Step 4b: State Encoding (Trainable)
Robot state (43 DOF joints: left_leg, right_leg, waist, arms, hands)
→ Min/max normalization (per-embodiment statistics)
→ CategorySpecificMLP (embodiment-specific weights)
→ state_features [B, 1, 1536]
The CategorySpecificMLP maintains separate weight matrices per embodiment (up to 32). W1: [num_embodiments, input_dim, hidden_dim] — the embodiment ID selects the correct weights at runtime.
Step 4c: Flow Matching Noise Injection (The Core Training Mechanism)
This is the key — GR00T uses flow matching (not standard diffusion):
# 1. Sample timestep from Beta distribution
t = sample_time(batch_size) # t ~ Beta(α=1.5, β=1.0), biased toward high-noise
# 2. Sample random noise
noise = torch.randn(actions.shape) # Same shape as action trajectory
# 3. Create noisy trajectory via linear interpolation
noisy_trajectory = (1 - t) * noise + t * actions
# t≈0 → mostly noise
# t≈1 → mostly clean action
# 4. Ground truth target is the velocity
velocity = actions - noise # What the model must predict
Step 4d: Action Encoding (Trainable)
noisy_trajectory [B, 16, action_dim] + timestep t
→ MultiEmbodimentActionEncoder:
→ W1: action MLP (per-embodiment)
→ Sinusoidal timestep encoding
→ W2: swish(action_embed) * timestep_embed (fusion)
→ W3: output projection
→ action_features [B, 16, 1024]
Step 4e: Diffusion Model — AlternateVLDiT (Trainable, 32 Layers)
This is the largest trainable component:
Input: concat(state_features, action_features) [B, 1+16, 1024]
Conditioning: backbone_features [B, seq_len, 2048]
For each of 32 layers (alternating):
├── Self-Attention: state+action tokens attend to each other
│ (conditioned on timestep)
├── Cross-Attention: attend to backbone_features (vision+language)
│ (this is how the model "sees" the image and "reads" the instruction)
└── Feed-Forward Network
Output: [B, 17, 1024]
Adaptive Layer Norm (AdaLN): Each layer’s normalization is conditioned on the diffusion timestep t, so the model knows what noise level it’s denoising from.
Step 4f: Action Decoding & Loss
DiT output → extract action portion → [B, 16, 1024]
→ CategorySpecificMLP (per-embodiment action decoder)
→ predicted_velocity [B, 16, action_dim]
Loss = MSE(predicted_velocity, ground_truth_velocity) * action_mask
= MSE(pred, actions - noise) * mask
The action_mask zeros out invalid action dimensions (e.g., if different embodiments have different DOF counts).
5. Loss Function Details
The loss type is MSE on flow matching velocity prediction:
velocity = actions - noise # Ground truth
pred = model_output[:, -action_horizon:] # Model prediction (last 16 tokens)
loss = F.mse_loss(pred, velocity, reduction="none") * action_mask
loss = loss.sum() / (action_mask.sum() + 1e-6)
Why velocity, not action? Flow matching learns the direction from noise to data, not the data itself. This is mathematically equivalent to learning the vector field that transports noise to clean actions.
Beta(1.5, 1.0) timestep sampling biases training toward higher noise levels (harder denoising tasks), improving sample quality.
6. Action Chunking / Action Horizon
- Action horizon = 16: The model predicts 16 future actions at once
delta_indices=list(range(16))in the modality config defines which future timesteps to predict- Actions use relative representation for arms (
ActionRepresentation.RELATIVE— relative to current state) and absolute for hands/waist/commands - Relative action conversion:
relative_action = action - current_state(done during data preprocessing) - During inference, only the first few actions from each 16-step chunk are executed before re-predicting
7. Data Pipeline Flow
Raw LeRobot-format dataset (parquet + h264 video)
↓
ShardedSingleStepDataset
- Sample random frame from random episode
- Decode video frame (h264 → RGB)
- Extract: image, state (43-DOF joints), action (43-DOF + commands), language
↓
Statistics Generation (cached)
- Compute min/max/mean/std per modality key
- Compute relative action statistics
↓
Gr00tN1d6Processor
- State normalization (min/max scaling to [-1, 1])
- Action normalization (same)
- Relative action conversion (for arm joints)
- Image augmentation:
• Color jitter (brightness=0.3, contrast=0.4, saturation=0.5, hue=0.08)
• Random rotation (if configured)
↓
Gr00tN1d6DataCollator (Eagle-based)
- Process images through Eagle any-res vision processor
- Tokenize language with left-padding
- Stack state/action/mask tensors
↓
Batch → Model Forward Pass
8. Optimizer & Scheduler
| Setting | Value |
|---|---|
| Optimizer | AdamW (adamw_torch) |
| Learning rate | 1e-4 |
| Weight decay | 0.01 |
| LR Scheduler | Cosine annealing with warmup |
| Warmup | 5% of total steps (50 steps for a 1000-step run) |
| Batch size | 32 (global, single GPU) |
LR schedule for a 1000-step run:
Steps 0-50: Linear warmup 0 → 1e-4
Steps 50-1000: Cosine decay 1e-4 → ~0
9. What Gets Saved After Fine-Tuning
Modified (saved in checkpoint):
- AlternateVLDiT (32-layer diffusion transformer): all weights updated
- State Encoder: embodiment-specific MLP weights
- Action Encoder: timestep fusion + action projection weights
- Action Decoder: embodiment-specific output projection
- Position embeddings
- Per-embodiment normalization statistics
Untouched:
- Eagle ViT vision encoder (100% frozen)
- Eagle LLM backbone (100% frozen)
- Vision-language projector MLP1 (100% frozen)
The checkpoint will contain only the action head weights plus the frozen backbone reference path.
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026