GR00T N1.6 Fine-Tuning — Full Internal Deep Dive

19th February 2026

GR00T N1.6 is NVIDIA’s Vision-Language-Action (VLA) model for humanoid robot control. After spending time digging through the internals, here’s a comprehensive deep dive into exactly how fine-tuning works — from model architecture to gradient flow to the data pipeline.

1. Model Architecture (3 Major Components)

GR00T N1.6 is built from three major components arranged in a pipeline:

┌─────────────────────────────────────────────────────────────┐
│                    Gr00tN1d6 Model                          │
│                                                             │
│  ┌──────────────────────────────────┐                       │
│  │  A. EagleBackbone (FROZEN)       │                       │
│  │  ├── Vision Encoder (ViT)        │──▶ image features     │
│  │  ├── Language Model (LLM)        │──▶ text features      │
│  │  └── MLP1 Projector              │──▶ fused features     │
│  │      Output: [B, seq_len, 2048]  │                       │
│  └──────────────────────────────────┘                       │
│                    ↓                                        │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  B. Gr00tN1d6ActionHead (TRAINABLE)                  │   │
│  │  ├── State Encoder (CategorySpecificMLP)             │   │
│  │  ├── Action Encoder (MultiEmbodimentActionEncoder)   │   │
│  │  ├── Diffusion Model (AlternateVLDiT, 32 layers)    │   │
│  │  └── Action Decoder (CategorySpecificMLP)            │   │
│  └──────────────────────────────────────────────────────┘   │
│                    ↓                                        │
│              predicted action trajectory [B, 16, action_dim]│
└─────────────────────────────────────────────────────────────┘

The EagleBackbone handles vision and language understanding — it encodes camera images and text instructions into a fused feature representation. The Action Head takes those features, combines them with the robot’s current state, and uses a diffusion transformer to predict a trajectory of 16 future actions.

2. Which Layers Are Frozen vs Trainable

This is the key design decision that makes fine-tuning feasible on a single GPU:

Component	Frozen	Trainable	Config Flag
Vision Encoder (ViT)	YES	No	tune_vision_tower=False
Language Model (LLM backbone)	YES	No	tune_llm=False
MLP1 (vision→language projector)	YES	No	Part of vision tower
State Encoder	No	YES	tune_projector=True
Action Encoder	No	YES	tune_projector=True
Action Decoder	No	YES	tune_projector=True
Diffusion Model (AlternateVLDiT, 32 layers)	No	YES	tune_diffusion_model=True
Position Embedding	No	YES	add_pos_embed=True

The entire vision-language backbone (~7B parameters) stays frozen. Only the action head (diffusion model + state/action encoders/decoders) gets fine-tuned. No LoRA is used by default — use_backbone_lora=0, use_llm_lora=0. The action head is fully fine-tuned with all parameters.

3. Frozen Module Handling During Training

Even though frozen parameters have requires_grad=False, the model explicitly sets them to eval mode at the start of every forward pass:

# In Gr00tN1d6ActionHead, at each forward():
if self.training:
    if not self.tune_diffusion_model:
        self.model.eval()     # disable dropout in DiT
    if not self.tune_projector:
        self.state_encoder.eval()
        self.action_encoder.eval()
        self.action_decoder.eval()

This ensures dropout and batch norm in frozen layers are disabled, giving deterministic features to the trainable layers.

4. The Forward Pass (Training) — Step by Step

Step 4a: Vision-Language Processing (Frozen Backbone)

ego_view image (480×640)
  → Eagle any-res processor (resize, crop, tile)
  → ViT encoder → image features

"pick the blue block"
  → Tokenizer → token IDs
  → LLM backbone (truncated at layer 16) → text features

→ MLP1 projector fuses them → backbone_features [B, seq_len, 2048]

Step 4b: State Encoding (Trainable)

Robot state (43 DOF joints: left_leg, right_leg, waist, arms, hands)
  → Min/max normalization (per-embodiment statistics)
  → CategorySpecificMLP (embodiment-specific weights)
  → state_features [B, 1, 1536]

The CategorySpecificMLP maintains separate weight matrices per embodiment (up to 32). W1: [num_embodiments, input_dim, hidden_dim] — the embodiment ID selects the correct weights at runtime.

Step 4c: Flow Matching Noise Injection (The Core Training Mechanism)

This is the key — GR00T uses flow matching (not standard diffusion):

# 1. Sample timestep from Beta distribution
t = sample_time(batch_size)  # t ~ Beta(α=1.5, β=1.0), biased toward high-noise

# 2. Sample random noise
noise = torch.randn(actions.shape)  # Same shape as action trajectory

# 3. Create noisy trajectory via linear interpolation
noisy_trajectory = (1 - t) * noise + t * actions
#   t≈0 → mostly noise
#   t≈1 → mostly clean action

# 4. Ground truth target is the velocity
velocity = actions - noise  # What the model must predict

Step 4d: Action Encoding (Trainable)

noisy_trajectory [B, 16, action_dim] + timestep t
  → MultiEmbodimentActionEncoder:
      → W1: action MLP (per-embodiment)
      → Sinusoidal timestep encoding
      → W2: swish(action_embed) * timestep_embed (fusion)
      → W3: output projection
  → action_features [B, 16, 1024]

Step 4e: Diffusion Model — AlternateVLDiT (Trainable, 32 Layers)

This is the largest trainable component:

Input: concat(state_features, action_features) [B, 1+16, 1024]
Conditioning: backbone_features [B, seq_len, 2048]

For each of 32 layers (alternating):
  ├── Self-Attention: state+action tokens attend to each other
  │     (conditioned on timestep)
  ├── Cross-Attention: attend to backbone_features (vision+language)
  │     (this is how the model "sees" the image and "reads" the instruction)
  └── Feed-Forward Network

Output: [B, 17, 1024]

Adaptive Layer Norm (AdaLN): Each layer’s normalization is conditioned on the diffusion timestep t, so the model knows what noise level it’s denoising from.

Step 4f: Action Decoding & Loss

DiT output → extract action portion → [B, 16, 1024]
  → CategorySpecificMLP (per-embodiment action decoder)
  → predicted_velocity [B, 16, action_dim]

Loss = MSE(predicted_velocity, ground_truth_velocity) * action_mask
     = MSE(pred, actions - noise) * mask

The action_mask zeros out invalid action dimensions (e.g., if different embodiments have different DOF counts).

5. Loss Function Details

The loss type is MSE on flow matching velocity prediction:

velocity = actions - noise                       # Ground truth
pred = model_output[:, -action_horizon:]         # Model prediction (last 16 tokens)
loss = F.mse_loss(pred, velocity, reduction="none") * action_mask
loss = loss.sum() / (action_mask.sum() + 1e-6)

Why velocity, not action? Flow matching learns the direction from noise to data, not the data itself. This is mathematically equivalent to learning the vector field that transports noise to clean actions.

Beta(1.5, 1.0) timestep sampling biases training toward higher noise levels (harder denoising tasks), improving sample quality.

6. Action Chunking / Action Horizon

Action horizon = 16: The model predicts 16 future actions at once
delta_indices=list(range(16)) in the modality config defines which future timesteps to predict
Actions use relative representation for arms (ActionRepresentation.RELATIVE — relative to current state) and absolute for hands/waist/commands
Relative action conversion: relative_action = action - current_state (done during data preprocessing)
During inference, only the first few actions from each 16-step chunk are executed before re-predicting

7. Data Pipeline Flow

Raw LeRobot-format dataset (parquet + h264 video)
     ↓
ShardedSingleStepDataset
  - Sample random frame from random episode
  - Decode video frame (h264 → RGB)
  - Extract: image, state (43-DOF joints), action (43-DOF + commands), language
     ↓
Statistics Generation (cached)
  - Compute min/max/mean/std per modality key
  - Compute relative action statistics
     ↓
Gr00tN1d6Processor
  - State normalization (min/max scaling to [-1, 1])
  - Action normalization (same)
  - Relative action conversion (for arm joints)
  - Image augmentation:
    • Color jitter (brightness=0.3, contrast=0.4, saturation=0.5, hue=0.08)
    • Random rotation (if configured)
     ↓
Gr00tN1d6DataCollator (Eagle-based)
  - Process images through Eagle any-res vision processor
  - Tokenize language with left-padding
  - Stack state/action/mask tensors
     ↓
Batch → Model Forward Pass

8. Optimizer & Scheduler

Setting	Value
Optimizer	AdamW (adamw_torch)
Learning rate	1e-4
Weight decay	0.01
LR Scheduler	Cosine annealing with warmup
Warmup	5% of total steps (50 steps for a 1000-step run)
Batch size	32 (global, single GPU)

LR schedule for a 1000-step run:

Steps   0-50:   Linear warmup   0 → 1e-4
Steps  50-1000: Cosine decay  1e-4 → ~0

9. What Gets Saved After Fine-Tuning

Modified (saved in checkpoint):

AlternateVLDiT (32-layer diffusion transformer): all weights updated
State Encoder: embodiment-specific MLP weights
Action Encoder: timestep fusion + action projection weights
Action Decoder: embodiment-specific output projection
Position embeddings
Per-embodiment normalization statistics

Untouched:

Eagle ViT vision encoder (100% frozen)
Eagle LLM backbone (100% frozen)
Vision-language projector MLP1 (100% frozen)

The checkpoint will contain only the action head weights plus the frozen backbone reference path.

Posted 19th February 2026 at 12:04 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog