How a VLA Controls a Robot Arm: GR00T N1.5 System Architecture from Camera to Motor

1st March 2026

I’ve been building a robot arm system that uses NVIDIA’s GR00T N1.5 — a Vision-Language-Action (VLA) model — to pick up objects from a table using only a camera, natural language instructions, and 50 demonstration episodes. After getting it working end-to-end, I wanted to write down the full system architecture for anyone trying to understand how all the pieces connect.

This post covers the complete data flow: from a camera frame entering the system to servo motors moving the arm, and everything in between.

System Overview

The system has three components talking in a tight loop:

Camera (world view)
       │
       ▼
┌─────────────┐     language: "Pick up       ┌──────────────────┐
│   VLA Model │ ◄── the cucumber..."    ◄──── │  Your Controller │
│  (GR00T N1.5│                               │  (Python script) │
│  checkpoint)│                               │                  │
└─────────────┘                               └──────────────────┘
       ▼                                              ▲
┌─────────────┐     state = [joint angles]            │
│  SO-101 Arm │ ──────────────────────────────────────┘
│  6 Motors   │
└─────────────┘

The camera provides visual context. The VLA policy server predicts what the arm should do next. The robot arm executes those predictions and reports its current state back. Repeat at 20Hz.

The VLA Policy Server (GR00T N1.5)

The policy server runs as a separate process, waiting for inference requests. Under the hood it’s a transformer trained on paired video + robot action data.

Every inference cycle, it receives a single observation:

{
  "video.webcam":                        np.array  # (1, 3, 480, 640) — one RGB frame
  "state.single_arm":                    np.array  # (5,) — [pan, lift, elbow, wrist_flex, wrist_roll]
  "state.gripper":                       np.array  # (1,) — [gripper_open_amount]
  "annotation.human.task_description":   ["Pick up the cucumber and place it in the bowl"]
}

And returns 8 future poses — a short motion plan:

[
  {"action.single_arm": [p, l, e, wf, wr], "action.gripper": [g]},   # frame 1
  {"action.single_arm": [p, l, e, wf, wr], "action.gripper": [g]},   # frame 2
  ...
  {"action.single_arm": [p, l, e, wf, wr], "action.gripper": [g]}    # frame 8
]

Each value is a joint angle in degrees. The controller executes these 8 poses sequentially, then requests a fresh prediction with the updated camera frame and joint state.

The Robot Arm: SO-101

The SO-101 is a 6-DOF arm built from Dynamixel servo motors. Each motor has a current position (where it is) and a target position (where you want it to go):

Joint	Name	Function
0	shoulder_pan	Rotates the base left/right
1	shoulder_lift	Raises/lowers the upper arm
2	elbow_flex	Bends the elbow
3	wrist_flex	Bends the wrist up/down
4	wrist_roll	Rotates the wrist
5	gripper	Opens/closes the hand

The motors communicate over USB serial (/dev/ttyACM1) using the Dynamixel protocol. LeRobot handles the low-level motor communication — you interact with a clean Python API.

The Control Loop: 6 Steps at 20Hz

Every ~50 milliseconds, one full cycle runs. Here’s exactly what happens:

Step 1 — Read State

robot.get_observation()
  → motors report current angles → state = [6 floats in degrees]
  → camera snaps a frame         → image = (480 × 640 × 3) RGB

Step 2 — Remap Keys

"front" key → "webcam" key   (matching the training data format)

A small but critical detail: the camera key in your observation dictionary must match the key the model was trained on. This is just a dictionary key rename.

Step 3 — Send to VLA

HTTP/ZMQ request to policy server at localhost:5555
Payload: image + joint state + language goal

Step 4 — Inference

GR00T transformer processes the observation
Predicts 8 future joint angle sequences
(~100ms inference time on RTX 4090)

Step 5 — Execute Actions

for each of 8 predicted poses:
    robot.set_target_state(angles)
    sleep(50ms)          ← maintains 20Hz execution rate

Motors move toward target angles
Dynamixel PID controller handles smooth interpolation

Step 6 — Repeat

Return to Step 1 with the new camera frame and updated joint state. The loop continues until the task is complete.

What the VLA Actually Does

The model learned from 50 teleoperated demonstrations. At each inference step, it’s effectively answering one question:

Given this image of the scene, the language goal, and where my arm currently is — what should the arm angles be in the next 8 frames?

There’s no explicit reasoning, no path planning algorithms, no inverse kinematics solver. The VLA is a pattern matcher — it was trained on (image, language, state) → action sequence pairs, and at inference time it reproduces motions that match the current scene.

This is both its power and its limitation. If the scene looks similar to training data, it works remarkably well. If you change the table, lighting, or object position too much, performance degrades. More diverse training episodes help, but there’s always a generalization boundary.

A Note on Sensing: No IMU Needed

The SO-101 has no IMU (inertial measurement unit). The entire robot state comes from motor encoders — joint angle readings with ~0.1° resolution.

IMUs matter for legged robots and humanoids that need to maintain balance. A tabletop arm bolted to a desk doesn’t need to know its orientation relative to gravity. Encoder-only state works fine here.

End-to-End Summary

Camera frame
    +                   ┌─────────────┐
Current joint angles ──►│  GR00T N1.5 │──► 8 future joint angle targets
    +                   └─────────────┘
"Pick up cucumber..."
                              │
                    execute one by one
                              │
                              ▼
                    Motors move to target angles
                              │
                              ▼
                    Read new state + new camera frame
                              │
                              └──► repeat

The VLA is essentially learned inverse kinematics fused with a task planner. It converts vision + language into motor commands, end-to-end. No hand-coded motion primitives, no coordinate frame transforms, no trajectory optimization. Just a neural network that watched you do the task 50 times and learned to imitate — and that’s enough to pick up a cucumber.

Posted 1st March 2026 at 4:36 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog