How a VLA Controls a Robot Arm: GR00T N1.5 System Architecture from Camera to Motor
1st March 2026
I’ve been building a robot arm system that uses NVIDIA’s GR00T N1.5 — a Vision-Language-Action (VLA) model — to pick up objects from a table using only a camera, natural language instructions, and 50 demonstration episodes. After getting it working end-to-end, I wanted to write down the full system architecture for anyone trying to understand how all the pieces connect.
This post covers the complete data flow: from a camera frame entering the system to servo motors moving the arm, and everything in between.
System Overview
The system has three components talking in a tight loop:
Camera (world view)
│
▼
┌─────────────┐ language: "Pick up ┌──────────────────┐
│ VLA Model │ ◄── the cucumber..." ◄──── │ Your Controller │
│ (GR00T N1.5│ │ (Python script) │
│ checkpoint)│ │ │
└─────────────┘ └──────────────────┘
▼ ▲
┌─────────────┐ state = [joint angles] │
│ SO-101 Arm │ ──────────────────────────────────────┘
│ 6 Motors │
└─────────────┘
The camera provides visual context. The VLA policy server predicts what the arm should do next. The robot arm executes those predictions and reports its current state back. Repeat at 20Hz.
The VLA Policy Server (GR00T N1.5)
The policy server runs as a separate process, waiting for inference requests. Under the hood it’s a transformer trained on paired video + robot action data.
Every inference cycle, it receives a single observation:
{
"video.webcam": np.array # (1, 3, 480, 640) — one RGB frame
"state.single_arm": np.array # (5,) — [pan, lift, elbow, wrist_flex, wrist_roll]
"state.gripper": np.array # (1,) — [gripper_open_amount]
"annotation.human.task_description": ["Pick up the cucumber and place it in the bowl"]
}
And returns 8 future poses — a short motion plan:
[
{"action.single_arm": [p, l, e, wf, wr], "action.gripper": [g]}, # frame 1
{"action.single_arm": [p, l, e, wf, wr], "action.gripper": [g]}, # frame 2
...
{"action.single_arm": [p, l, e, wf, wr], "action.gripper": [g]} # frame 8
]
Each value is a joint angle in degrees. The controller executes these 8 poses sequentially, then requests a fresh prediction with the updated camera frame and joint state.
The Robot Arm: SO-101
The SO-101 is a 6-DOF arm built from Dynamixel servo motors. Each motor has a current position (where it is) and a target position (where you want it to go):
| Joint | Name | Function |
|---|---|---|
| 0 | shoulder_pan | Rotates the base left/right |
| 1 | shoulder_lift | Raises/lowers the upper arm |
| 2 | elbow_flex | Bends the elbow |
| 3 | wrist_flex | Bends the wrist up/down |
| 4 | wrist_roll | Rotates the wrist |
| 5 | gripper | Opens/closes the hand |
The motors communicate over USB serial (/dev/ttyACM1) using the Dynamixel protocol. LeRobot handles the low-level motor communication — you interact with a clean Python API.
The Control Loop: 6 Steps at 20Hz
Every ~50 milliseconds, one full cycle runs. Here’s exactly what happens:
Step 1 — Read State
robot.get_observation()
→ motors report current angles → state = [6 floats in degrees]
→ camera snaps a frame → image = (480 × 640 × 3) RGB
Step 2 — Remap Keys
"front" key → "webcam" key (matching the training data format)
A small but critical detail: the camera key in your observation dictionary must match the key the model was trained on. This is just a dictionary key rename.
Step 3 — Send to VLA
HTTP/ZMQ request to policy server at localhost:5555
Payload: image + joint state + language goal
Step 4 — Inference
GR00T transformer processes the observation
Predicts 8 future joint angle sequences
(~100ms inference time on RTX 4090)
Step 5 — Execute Actions
for each of 8 predicted poses:
robot.set_target_state(angles)
sleep(50ms) ← maintains 20Hz execution rate
Motors move toward target angles
Dynamixel PID controller handles smooth interpolation
Step 6 — Repeat
Return to Step 1 with the new camera frame and updated joint state. The loop continues until the task is complete.
What the VLA Actually Does
The model learned from 50 teleoperated demonstrations. At each inference step, it’s effectively answering one question:
Given this image of the scene, the language goal, and where my arm currently is — what should the arm angles be in the next 8 frames?
There’s no explicit reasoning, no path planning algorithms, no inverse kinematics solver. The VLA is a pattern matcher — it was trained on (image, language, state) → action sequence pairs, and at inference time it reproduces motions that match the current scene.
This is both its power and its limitation. If the scene looks similar to training data, it works remarkably well. If you change the table, lighting, or object position too much, performance degrades. More diverse training episodes help, but there’s always a generalization boundary.
A Note on Sensing: No IMU Needed
The SO-101 has no IMU (inertial measurement unit). The entire robot state comes from motor encoders — joint angle readings with ~0.1° resolution.
IMUs matter for legged robots and humanoids that need to maintain balance. A tabletop arm bolted to a desk doesn’t need to know its orientation relative to gravity. Encoder-only state works fine here.
End-to-End Summary
Camera frame
+ ┌─────────────┐
Current joint angles ──►│ GR00T N1.5 │──► 8 future joint angle targets
+ └─────────────┘
"Pick up cucumber..."
│
execute one by one
│
▼
Motors move to target angles
│
▼
Read new state + new camera frame
│
└──► repeat
The VLA is essentially learned inverse kinematics fused with a task planner. It converts vision + language into motor commands, end-to-end. No hand-coded motion primitives, no coordinate frame transforms, no trajectory optimization. Just a neural network that watched you do the task 50 times and learned to imitate — and that’s enough to pick up a cucumber.
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026