Akshay Parkhi on physical-ai

18 posts tagged “physical-ai”

2026

OpenUSD: Advanced Patterns and Common Gotchas.

Deeper OpenUSD concepts — schemas, rendering rules, performance patterns, and the gotchas that catch people off guard.

4:31 pm / 28th March 2026 / physical-ai

OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey

OpenUSD (Universal Scene Description) is not just a 3D modeling format — it’s a universal language for describing complex scenes, their relationships, and their properties. Think of it as JSON for 3D worlds, but infinitely more powerful.

[... 1,260 words]

4:35 pm / 25th March 2026 / physical-ai

Learning OpenUSD — From Curious Questions to Real Understanding

Written as I explored OpenUSD before my exam. These are real questions I asked, and the answers that actually made things click for me.

[... 1,135 words]

3:09 pm / 19th March 2026 / physical-ai

How Everything Connects — NVIDIA’s Cosmos Pipeline from Simulation to Real-World Robots

Training robots and autonomous vehicles is fundamentally dangerous and expensive. You can’t crash 1,000 cars to teach collision avoidance, and you can’t let a robot fall off cliffs to learn edge detection. NVIDIA’s solution is an end-to-end pipeline that generates synthetic data so realistic that AI models trained on it transfer directly to the real world. Here’s how every piece connects.

[... 1,776 words]

9:06 am / 11th March 2026 / physical-ai

Complete Guide: Setting Up XRoboToolkit for Robot Teleoperation with Pico 4 Ultra on WSL2

A step-by-step guide to setting up XR-based robot teleoperation using the Pico 4 Ultra headset, XRoboToolkit, and MuJoCo simulation — all running on Windows WSL2.

[... 1,293 words]

1:22 am / 5th March 2026 / physical-ai

ROS 2 Humble: Complete Installation Guide with Turtlesim from Zero to First Node

This is a complete walkthrough for installing ROS 2 Humble on Ubuntu 22.04 and getting your first robot simulation running with Turtlesim. I wrote this after going through the process myself — the official docs are thorough but scattered across many pages. This puts everything in one place, from locale setup to writing your first Python node.

[... 2,166 words]

4:31 pm / 2nd March 2026 / physical-ai

How a VLA Controls a Robot Arm: GR00T N1.5 System Architecture from Camera to Motor

I’ve been building a robot arm system that uses NVIDIA’s GR00T N1.5 — a Vision-Language-Action (VLA) model — to pick up objects from a table using only a camera, natural language instructions, and 50 demonstration episodes. After getting it working end-to-end, I wanted to write down the full system architecture for anyone trying to understand how all the pieces connect.

[... 912 words]

4:36 pm / 1st March 2026 / physical-ai

Collecting Training Data for VLA Robot Fine-Tuning (The Hard Way)

A Vision-Language-Action model takes camera images and a language instruction as input, and outputs robot joint actions. NVIDIA’s GR00T N1.5 is one such model — pre-trained on millions of robot demonstrations and fine-tunable for your specific robot and task. The catch: even though GR00T is pre-trained, you still need your own demonstrations to teach it your robot’s exact joint calibration, camera angles, and task environment. Without this, the model generates actions that are plausible in general but wrong for your specific setup.

[... 1,771 words]

6:55 am / 28th February 2026 / physical-ai

Teaching a Humanoid Robot to Wave: Custom Motions with GEAR-SONIC

GEAR-SONIC can track arbitrary motions — not just its built-in locomotion styles. You define joint angles in a CSV, preview them with direct replay in MuJoCo, then deploy through the SONIC neural network. Here’s how the three-stage pipeline works.

[... 812 words]

12:42 pm / 24th February 2026 / physical-ai

VLA → WBC → MuJoCo: Two Ways to Wire Up NVIDIA’s GR00T Humanoid Stack

There are two ways to wire up NVIDIA’s GR00T stack from vision-language all the way down to physics simulation: the official NVIDIA eval pipeline and a custom pipeline using the SONIC C++ binary. I’ve set up both. Here’s how they work and where they differ.

[... 674 words]

6:58 pm / 22nd February 2026 / physical-ai

From Vision to Torques: How NVIDIA’s GR00T Stack Controls a Humanoid Robot

NVIDIA’s GR00T stack for humanoid robots has three layers: a Vision-Language-Action model that understands what to do, a whole-body controller that figures out how to move, and a physics simulator that validates it all before touching real hardware. Here’s how they connect.

[... 976 words]

5:19 pm / 22nd February 2026 / physical-ai

GEAR-SONIC

GEAR-SONIC (Supersizing Motion Tracking for Natural Humanoid Whole-Body Control) is the big upgrade over the Decoupled WBC approach in the GR00T stack. It’s a completely different approach to humanoid control — unified whole-body, trained on human motion data rather than hand-crafted reward functions.

[... 325 words]

12:37 pm / 21st February 2026 / physical-ai

NVIDIA’s GR00T Whole-Body Control stack in MuJoCo

I’ve been running NVIDIA’s GR00T Whole-Body Control stack in MuJoCo — the sim-to-real bridge for humanoid robot locomotion. A MuJoCo viewer showing a simulated robot walking might look like a toy, but the neural network policy inside it is the same binary that runs on a real Unitree G1. Here’s what’s actually going on.

[... 759 words]

6:42 pm / 20th February 2026 / physical-ai

GR00T Architecture: A Systems Engineering Breakdown

GR00T is not just a VLM. It is a Perception → Reasoning → Control generator stack.

[... 762 words]

1:04 pm / 19th February 2026 / physical-ai

GR00T N1.6 Fine-Tuning — Full Internal Deep Dive

GR00T N1.6 is NVIDIA’s Vision-Language-Action (VLA) model for humanoid robot control. After spending time digging through the internals, here’s a comprehensive deep dive into exactly how fine-tuning works — from model architecture to gradient flow to the data pipeline.

[... 1,200 words]

12:04 pm / 19th February 2026 / physical-ai

PPO vs VLM

Modern humanoid robots combine two fundamentally different kinds of intelligence:

[... 638 words]

4:36 am / 19th February 2026 / physical-ai

How GR00T Merges Vision, Chat, and Action

The biggest challenge is that vision models speak “Image-ish” (pixels) while chat models speak “Text-ish” (tokens). GR00T uses a specialized component called a Projector to act as a real-time translator.

[... 377 words]

2:24 am / 18th February 2026 / physical-ai

GR00T N1.6 Architecture and Parameter Distribution

GR00T uses a massive “backbone” to understand its surroundings. It combines SigLIP 2 (for vision) and Qwen 3 (for language). While the eyes are frozen to keep perception stable, the reasoning layers are partially trainable to help the robot learn specific tasks.

[... 362 words]

2:16 am / 18th February 2026 / physical-ai