18 posts tagged “physical-ai”
2026
OpenUSD: Advanced Patterns and Common Gotchas.
Deeper OpenUSD concepts — schemas, rendering rules, performance patterns, and the gotchas that catch people off guard.
[... 1,122 words]OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey
OpenUSD (Universal Scene Description) is not just a 3D modeling format — it’s a universal language for describing complex scenes, their relationships, and their properties. Think of it as JSON for 3D worlds, but infinitely more powerful.
[... 1,260 words]Learning OpenUSD — From Curious Questions to Real Understanding
Written as I explored OpenUSD before my exam. These are real questions I asked, and the answers that actually made things click for me.
[... 1,135 words]How Everything Connects — NVIDIA’s Cosmos Pipeline from Simulation to Real-World Robots
Training robots and autonomous vehicles is fundamentally dangerous and expensive. You can’t crash 1,000 cars to teach collision avoidance, and you can’t let a robot fall off cliffs to learn edge detection. NVIDIA’s solution is an end-to-end pipeline that generates synthetic data so realistic that AI models trained on it transfer directly to the real world. Here’s how every piece connects.
[... 1,776 words]Complete Guide: Setting Up XRoboToolkit for Robot Teleoperation with Pico 4 Ultra on WSL2
A step-by-step guide to setting up XR-based robot teleoperation using the Pico 4 Ultra headset, XRoboToolkit, and MuJoCo simulation — all running on Windows WSL2.
[... 1,293 words]ROS 2 Humble: Complete Installation Guide with Turtlesim from Zero to First Node
This is a complete walkthrough for installing ROS 2 Humble on Ubuntu 22.04 and getting your first robot simulation running with Turtlesim. I wrote this after going through the process myself — the official docs are thorough but scattered across many pages. This puts everything in one place, from locale setup to writing your first Python node.
[... 2,166 words]How a VLA Controls a Robot Arm: GR00T N1.5 System Architecture from Camera to Motor
I’ve been building a robot arm system that uses NVIDIA’s GR00T N1.5 — a Vision-Language-Action (VLA) model — to pick up objects from a table using only a camera, natural language instructions, and 50 demonstration episodes. After getting it working end-to-end, I wanted to write down the full system architecture for anyone trying to understand how all the pieces connect.
[... 912 words]Collecting Training Data for VLA Robot Fine-Tuning (The Hard Way)
A Vision-Language-Action model takes camera images and a language instruction as input, and outputs robot joint actions. NVIDIA’s GR00T N1.5 is one such model — pre-trained on millions of robot demonstrations and fine-tunable for your specific robot and task. The catch: even though GR00T is pre-trained, you still need your own demonstrations to teach it your robot’s exact joint calibration, camera angles, and task environment. Without this, the model generates actions that are plausible in general but wrong for your specific setup.
[... 1,771 words]Teaching a Humanoid Robot to Wave: Custom Motions with GEAR-SONIC
GEAR-SONIC can track arbitrary motions — not just its built-in locomotion styles. You define joint angles in a CSV, preview them with direct replay in MuJoCo, then deploy through the SONIC neural network. Here’s how the three-stage pipeline works.
[... 812 words]VLA → WBC → MuJoCo: Two Ways to Wire Up NVIDIA’s GR00T Humanoid Stack
There are two ways to wire up NVIDIA’s GR00T stack from vision-language all the way down to physics simulation: the official NVIDIA eval pipeline and a custom pipeline using the SONIC C++ binary. I’ve set up both. Here’s how they work and where they differ.
[... 674 words]From Vision to Torques: How NVIDIA’s GR00T Stack Controls a Humanoid Robot
NVIDIA’s GR00T stack for humanoid robots has three layers: a Vision-Language-Action model that understands what to do, a whole-body controller that figures out how to move, and a physics simulator that validates it all before touching real hardware. Here’s how they connect.
[... 976 words]GEAR-SONIC
GEAR-SONIC (Supersizing Motion Tracking for Natural Humanoid Whole-Body Control) is the big upgrade over the Decoupled WBC approach in the GR00T stack. It’s a completely different approach to humanoid control — unified whole-body, trained on human motion data rather than hand-crafted reward functions.
[... 325 words]NVIDIA’s GR00T Whole-Body Control stack in MuJoCo
I’ve been running NVIDIA’s GR00T Whole-Body Control stack in MuJoCo — the sim-to-real bridge for humanoid robot locomotion. A MuJoCo viewer showing a simulated robot walking might look like a toy, but the neural network policy inside it is the same binary that runs on a real Unitree G1. Here’s what’s actually going on.
[... 759 words]GR00T Architecture: A Systems Engineering Breakdown
GR00T is not just a VLM. It is a Perception → Reasoning → Control generator stack.
[... 762 words]GR00T N1.6 Fine-Tuning — Full Internal Deep Dive
GR00T N1.6 is NVIDIA’s Vision-Language-Action (VLA) model for humanoid robot control. After spending time digging through the internals, here’s a comprehensive deep dive into exactly how fine-tuning works — from model architecture to gradient flow to the data pipeline.
[... 1,200 words]PPO vs VLM
Modern humanoid robots combine two fundamentally different kinds of intelligence:
[... 638 words]How GR00T Merges Vision, Chat, and Action
The biggest challenge is that vision models speak “Image-ish” (pixels) while chat models speak “Text-ish” (tokens). GR00T uses a specialized component called a Projector to act as a real-time translator.
[... 377 words]GR00T N1.6 Architecture and Parameter Distribution
GR00T uses a massive “backbone” to understand its surroundings. It combines SigLIP 2 (for vision) and Qwen 3 (for language). While the eyes are frozen to keep perception stable, the reasoning layers are partially trainable to help the robot learn specific tasks.
[... 362 words]