GR00T N1.6 Architecture and Parameter Distribution

18th February 2026

1. Perception and Reasoning (The Eyes and High-Level Brain)

GR00T uses a massive “backbone” to understand its surroundings. It combines SigLIP 2 (for vision) and Qwen 3 (for language). While the eyes are frozen to keep perception stable, the reasoning layers are partially trainable to help the robot learn specific tasks.

{
"Vision_System": "SigLIP 2 (Frozen)",
"Reasoning_Engine": "Qwen 3 (Last 4 layers trainable)",
"Input_Processing": "Visual patches mapped to 2048-dim reasoning space",
"Logic": "Translates human commands like 'pick up the cup' into a plan."
}

2. The Universal Remote (32-Robot Support)

One of the “secret” internal features revealed in the logs is the CategorySpecificMLP. This allows one single GR00T model to act as a “universal remote” for up to 32 different types of robots. Whether it’s a humanoid or a robotic arm, the model simply switches to the correct internal “slot.”

{
"Module": "CategorySpecificMLP",
"Embodiment_Limit": 32,
"Unified_Action_Space": 128,
"Internal_Scaling": "Separate weights for 32 different robot kinematics.",
"Purpose": "Allows the same brain to control diverse hardware shapes."
}

3. Muscle Memory (The Diffusion Movement Engine)

To move smoothly, GR00T uses a 32-layer AlternateVLDiT. Instead of snapping to a position, it “denoises” a movement—starting with a rough idea and refining it into a smooth path. It checks the camera and the instructions every two layers to make sure the movement is still correct.

{
"Action_Head": "AlternateVLDiT (Diffusion Transformer)",
"Layers": 32,
"Conditioning": "Interleaved Text and Image features every 2 blocks",
"Precision": "Cast to float32 to prevent 'shaky' robot movements",
"Internal_Dim": 1536
}

Summary

Component	Purpose
`SigLIP 2`	The “Eyes” that identify objects and the environment.
`Qwen 3 (Top 4)`	The “Logic” that figures out the goal of the instruction.
`State Encoder`	Encodes the robot’s current position (proprioception).
`Action Encoder`	Prepares noisy actions for the diffusion cleaning process.
`AlternateVLDiT`	The core “Muscle Memory” generating 1B+ parameters of movement.
`Action Decoder`	The final translator that tells the robot’s motors how much to spin.
`Position Embedding`	A 1024-step map that ensures actions happen in the right order.

Note: To keep the robot from getting “confused” during complex movements, the training system upgrades the core movement parameters to float32 precision, while the rest of the brain stays in the lighter bfloat16 format.

Posted 18th February 2026 at 2:16 am · Subscribe to my newsletter

Akshay Parkhi's Weblog