Akshay Parkhi's Weblog

Subscribe

GR00T N1.6 Architecture and Parameter Distribution

18th February 2026

1. Perception and Reasoning (The Eyes and High-Level Brain)

GR00T uses a massive “backbone” to understand its surroundings. It combines SigLIP 2 (for vision) and Qwen 3 (for language). While the eyes are frozen to keep perception stable, the reasoning layers are partially trainable to help the robot learn specific tasks.

{
"Vision_System": "SigLIP 2 (Frozen)",
"Reasoning_Engine": "Qwen 3 (Last 4 layers trainable)",
"Input_Processing": "Visual patches mapped to 2048-dim reasoning space",
"Logic": "Translates human commands like 'pick up the cup' into a plan."
}

2. The Universal Remote (32-Robot Support)

One of the “secret” internal features revealed in the logs is the CategorySpecificMLP. This allows one single GR00T model to act as a “universal remote” for up to 32 different types of robots. Whether it’s a humanoid or a robotic arm, the model simply switches to the correct internal “slot.”

{
"Module": "CategorySpecificMLP",
"Embodiment_Limit": 32,
"Unified_Action_Space": 128,
"Internal_Scaling": "Separate weights for 32 different robot kinematics.",
"Purpose": "Allows the same brain to control diverse hardware shapes."
}

3. Muscle Memory (The Diffusion Movement Engine)

To move smoothly, GR00T uses a 32-layer AlternateVLDiT. Instead of snapping to a position, it “denoises” a movement—starting with a rough idea and refining it into a smooth path. It checks the camera and the instructions every two layers to make sure the movement is still correct.

{
"Action_Head": "AlternateVLDiT (Diffusion Transformer)",
"Layers": 32,
"Conditioning": "Interleaved Text and Image features every 2 blocks",
"Precision": "Cast to float32 to prevent 'shaky' robot movements",
"Internal_Dim": 1536
}

Summary

ComponentPurpose
SigLIP 2The “Eyes” that identify objects and the environment.
Qwen 3 (Top 4)The “Logic” that figures out the goal of the instruction.
State EncoderEncodes the robot’s current position (proprioception).
Action EncoderPrepares noisy actions for the diffusion cleaning process.
AlternateVLDiTThe core “Muscle Memory” generating 1B+ parameters of movement.
Action DecoderThe final translator that tells the robot’s motors how much to spin.
Position EmbeddingA 1024-step map that ensures actions happen in the right order.

Note: To keep the robot from getting “confused” during complex movements, the training system upgrades the core movement parameters to float32 precision, while the rest of the brain stays in the lighter bfloat16 format.

This is GR00T N1.6 Architecture and Parameter Distribution by Akshay Parkhi, posted on 18th February 2026.

Next: How GR00T Merges Vision, Chat, and Action

Previous: What I Learned Building a Streaming Agent on AWS Bedrock AgentCore Runtime