Akshay Parkhi's Weblog

Subscribe

Collecting Training Data for VLA Robot Fine-Tuning (The Hard Way)

28th February 2026

A Vision-Language-Action model takes camera images and a language instruction as input, and outputs robot joint actions. NVIDIA’s GR00T N1.5 is one such model — pre-trained on millions of robot demonstrations and fine-tunable for your specific robot and task. The catch: even though GR00T is pre-trained, you still need your own demonstrations to teach it your robot’s exact joint calibration, camera angles, and task environment. Without this, the model generates actions that are plausible in general but wrong for your specific setup.

Here’s everything I learned building a pick-and-place dataset for GR00T on an SO-101 arm — including every painful mistake along the way.

Hardware Setup

  • Robot: SO-101 follower arm (Feetech STS3215 motors)
  • Front camera: Logitech C270 USB webcam
  • Wrist camera: Generic USB “Web Camera”
  • Computer: Windows 11 + WSL2 (Ubuntu 22.04)
  • GPU: NVIDIA RTX (for policy server)

What is the LeRobot Data Format

LeRobot is Hugging Face’s robot learning framework. Its data format is the standard for recording, sharing, and training on robot demonstrations. A LeRobot dataset looks like this:

banana-pick/
├── meta/
│   ├── info.json          ← dataset metadata (fps, motor count, camera config)
│   ├── episodes.jsonl     ← per-episode info (length, task description)
│   └── modality.json      ← GR00T-specific: maps columns to model inputs
├── data/
│   └── train-00000-of-00001.parquet  ← joint states + actions (tabular)
└── videos/
    ├── front_rgb_episode_000000.mp4  ← front camera video
    └── wrist_rgb_episode_000000.mp4  ← wrist camera video

The parquet files contain timestamped rows with joint positions (state), joint targets (action), and episode indices. The videos are synchronized MP4 files at the same FPS. This format is what GR00T’s fine-tuning script, LeRobot’s training pipeline, and Hugging Face Hub all expect. When you run lerobot.record, it writes directly into this format.

Step 1: Get the Robot and Cameras Working

Problem: Cameras timed out in WSL2

RuntimeError: OpenCVCamera(2) read failed (status=False)
[ WARN] global cap_v4l.cpp:1049 tryIoctl VIDEOIO(V4L2:/dev/video2): select() timeout.

Root cause: WSL2’s usbipd USB passthrough uses YUYV (uncompressed) pixel format by default. At 640x480 @ 30fps, this is ~27 MB/s per camera — too much for the USB-over-IP tunnel.

Fix: Force MJPEG codec right after VideoCapture opens. Edit the LeRobot camera driver:

# /path/to/lerobot/cameras/opencv/camera_opencv.py
self.videocapture = cv2.VideoCapture(self.index_or_path, self.backend)

# Add this line:
self.videocapture.set(cv2.CAP_PROP_FOURCC,
    cv2.VideoWriter_fourcc('M', 'J', 'P', 'G'))

MJPEG compresses frames before USB transfer, reducing bandwidth ~10x.

Problem: Camera indices change after every usbipd re-attach

After detaching and re-attaching cameras in PowerShell, /dev/video0 might become /dev/video4. Always check:

for i in 0 1 2 3 4 5; do
  name=$(cat /sys/class/video4linux/video$i/name 2>/dev/null)
  echo "video$i: $name"
done

Or use LeRobot’s built-in tool:

python -m lerobot.find_cameras opencv

Problem: Two cameras simultaneously = corruption

Running two cameras at the same time over WSL2 usbipd causes partial JPEG frames — you see the top strip of a real image then solid blue blocks below it.

Corrupt JPEG data: 196 extraneous bytes before marker 0xdd

Fix: Use only one camera for data collection (the front camera). Single-camera GR00T models work well for pick-and-place. Use --data-config so100 instead of so100_dualcam when fine-tuning.

Permanent fix: Run Linux natively (dual boot) — no USB passthrough bottleneck.

Step 2: Fix the Policy Server Serialization

When the GR00T policy server (running in the gr00t conda env) and the Dum-E controller (running in the dum-e conda env) use different PyTorch versions, you get:

_pickle.UnpicklingError: unpickling stack underflow

Root cause: The controller used torch.save/load (TorchSerializer) but the Isaac-GR00T server uses msgpack (MsgSerializer). They speak different protocols over ZMQ.

Fix: Update the Dum-E client policy/gr00t/service.py to use msgpack:

import io, msgpack, numpy as np, zmq

class MsgSerializer:
    @staticmethod
    def to_bytes(data: dict) -> bytes:
        return msgpack.packb(data,
            default=MsgSerializer.encode_custom_classes)

    @staticmethod
    def from_bytes(data: bytes) -> dict:
        return msgpack.unpackb(data,
            object_hook=MsgSerializer.decode_custom_classes)

    @staticmethod
    def decode_custom_classes(obj):
        if "__ndarray_class__" in obj:
            obj = np.load(io.BytesIO(obj["as_npy"]),
                allow_pickle=False)
        return obj

    @staticmethod
    def encode_custom_classes(obj):
        if isinstance(obj, np.ndarray):
            output = io.BytesIO()
            np.save(output, obj, allow_pickle=False)
            return {"__ndarray_class__": True,
                    "as_npy": output.getvalue()}
        return obj

Then install msgpack in the client env: pip install msgpack

Debug tip: If you’re not sure what the server is returning, inspect the raw ZMQ bytes:

import zmq
ctx = zmq.Context()
sock = ctx.socket(zmq.REQ)
sock.connect('tcp://localhost:5555')
sock.send(b'test')
msg = sock.recv()
print(f'First 50 bytes: {msg[:50]}')

Step 3: Data Collection Without a Leader Arm

The standard approach for SO-101 data collection is teleoperation with a leader arm — a second SO-101 you hold and move while the follower mirrors it. But at ~$100/arm, it’s a significant cost. There are alternatives.

Teleoperation Options

OptionCostQualityNotes
Leader arm (second SO-101)~$100BestWhat Dum-E was designed for
SpaceMouse~$100Good6DOF input, --teleop.type=spacemouse
Gamepad / Xbox controller~$30Good--teleop.type=gamepad, much smoother than keyboard
KeyboardFreePoorSlow, jerky motions — quality matters a lot for VLA
Existing datasetFreeVariesDownload from HuggingFace, skip collection entirely
Kinesthetic teachingFreeGoodMove the arm by hand with motors disabled

Why Keyboard Teleoperation Does NOT Work

LeRobot’s --teleop.type=keyboard sends raw key names as actions. The so101_follower send_action() expects joint position arrays. They’re incompatible — you’ll get:

StopIteration at motors_bus.py:1179 in sync_write("Goal_Position", ...)

Keyboard teleoperation only works with So100FollowerEndEffector (end-effector Cartesian control mode), not standard joint control.

Kinesthetic Teaching (Free, No Extra Hardware)

Disable motor torque — the arm goes limp and can be moved by hand. Physically guide it through the task while recording joint positions + camera at 30fps. Re-enable torque after each episode.

import time, numpy as np, cv2
from lerobot.cameras.opencv.camera_opencv import OpenCVCamera
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.robots import make_robot_from_config
from lerobot.robots.so101_follower import SO101FollowerConfig

# Connect robot without cameras
robot_cfg = SO101FollowerConfig(
    id="my_robot", port="/dev/ttyACM0", cameras={})
robot = make_robot_from_config(robot_cfg)
robot.connect(calibrate=False)

# Connect camera separately
cam = OpenCVCamera(OpenCVCameraConfig(
    index_or_path=1, fps=30, width=640, height=480))
cam.connect()

state_keys = ["shoulder_pan.pos", "shoulder_lift.pos",
              "elbow_flex.pos", "wrist_flex.pos",
              "wrist_roll.pos", "gripper.pos"]

# Record one episode — arm goes limp, you move it by hand
frames = []
with robot.bus.torque_disabled():
    start = time.time()
    while time.time() - start < 10:  # 10 second episode
        obs = robot.get_observation()
        state = np.array([obs[k] for k in state_keys])
        img = cam.read()
        frames.append({"state": state, "image": img})
        time.sleep(1/30)

print(f"Recorded {len(frames)} frames")

Using an Existing Dataset

Skip data collection entirely — download someone else’s fruit-picking dataset and fine-tune on that:

huggingface-cli download \
  --repo-type dataset youliangtan/so100_strawberry_grape \
  --local-dir ./demo_data/so100_strawberry_grape

This dataset is SO-100 with similar pick-and-place tasks. It won’t be perfect for your exact setup but is better than no data. Recommended: start with an existing dataset to validate the pipeline, then collect your own data with a gamepad or second arm.

Step 4: Data Quality Checklist

Bad data is worse than no data. Check every episode before keeping it.

import json, numpy as np, cv2
from pathlib import Path

base = Path('./demo_data/banana-pick')
for ep in sorted(base.glob('episode_*')):
    states = np.load(ep / 'states.npy')
    cap = cv2.VideoCapture(str(ep / 'front.mp4'))
    ret, frame = cap.read()
    cap.release()

    sharpness = cv2.Laplacian(
        cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY), cv2.CV_64F
    ).var() if ret else 0

    motion = np.abs(states[-1] - states[0]).sum()
    brightness = cv2.cvtColor(
        frame, cv2.COLOR_BGR2GRAY).mean() if ret else 0

    ok = motion > 40 and sharpness > 150 and brightness > 30
    status = "OK" if ok else "BAD"
    print(f'{ep.name}: motion={motion:.0f} '
          f'sharpness={sharpness:.0f} [{status}]')

Quality Thresholds

MetricMinimumNotes
Sharpness (Laplacian variance)> 150< 50 = camera covered or blurred
Motion (total joint degrees)> 40< 20 = arm barely moved
Brightness> 30< 30 = too dark, model can’t see
Frames per episode~300 (10s @ 30fps)Fewer = incomplete demo

Common Reasons for Bad Episodes

ProblemSymptomFix
Camera moved during recordingSharpness drops mid-episodeMount camera with tape/clamp
Camera facing wrong directionLow sharpness, no arm visibleReposition before every session
Arm barely movedLow motion scoreRedo — complete full pick-and-place
MJPEG corruption (WSL2)Blue block artifacts, sharpness = 3Use single camera or native Linux
Too darkLow brightnessAdd a desk lamp above workspace

Step 5: Camera Positioning

The most underrated part of data collection. The model learns entirely from what the camera sees — if the camera moves between episodes, or doesn’t show the object and arm clearly, the model learns noise.

  • Fix the camera — tape it, clamp it, don’t touch it between episodes
  • Show the whole task — banana, plate, and robot arm must all be visible
  • Consistent position — same angle every session, every day
  • Good lighting — natural or desk lamp, avoid shadows on the workspace
  • No background clutter — plain table surface is better than a messy desk

Step 6: Recording with LeRobot

If you have a leader arm or gamepad, the standard LeRobot recording command:

conda activate dum-e

python -m lerobot.record \
  --robot.type=so101_follower \
  --robot.port=/dev/ttyACM0 \
  --robot.id=my_awesome_follower_arm \
  --robot.cameras="{ front: {type: opencv, index_or_path: 0,
    width: 640, height: 480, fps: 30}}" \
  --dataset.repo_id=YOUR_HF_USERNAME/banana-pick \
  --dataset.single_task="Pick up the banana and put it on the plate." \
  --dataset.num_episodes=50

Tips for good demonstrations:

  • Aim for 50-100 episodes — more is better
  • Keep the camera position identical every session
  • Vary the banana position slightly across episodes so the model generalizes
  • Slow, smooth motions — jerky demos teach jerky behavior
  • Each episode: start from rest, pick banana, place on plate, stop

Step 7: Fine-Tune GR00T

Once you have 50+ good episodes:

# 1. Copy the modality config
cp Isaac-GR00T/examples/so100__modality.json \
   ./demo_data/banana-pick/meta/modality.json

# 2. Fine-tune (in gr00t conda env)
conda activate gr00t
cd Isaac-GR00T

python scripts/gr00t_finetune.py \
  --dataset-path ./demo_data/banana-pick \
  --num-gpus 1 \
  --batch-size 32 \
  --output-dir ~/banana-checkpoint \
  --max-steps 10000 \
  --data-config so100 \
  --video-backend torchvision_av

# 3. Start policy server with your checkpoint
python scripts/inference_service.py \
  --server \
  --model_path ~/banana-checkpoint \
  --embodiment_tag new_embodiment \
  --data_config so100

# 4. Run the robot (in dum-e conda env)
conda activate dum-e

python -m embodiment.so_arm10x.controller \
  --config my-dum-e.yaml \
  --instruction "Pick up the banana and put it on the plate."

Reduce --batch-size if you run out of VRAM (try 16 or 8). 10k steps takes ~2-4 hours on an RTX 3060.

Lessons Learned

  1. WSL2 + two cameras = trouble — use one camera or boot Linux natively
  2. Camera indices change after every usbipd re-attach — always check before recording
  3. MJPEG fixes WSL2 camera timeouts — set fourcc before the first read
  4. Serialization must match — server and client must use the same format (msgpack vs torch.save)
  5. Keyboard teleoperation doesn’t work with joint-space robots — use kinesthetic teaching or a gamepad
  6. Mount the camera securely — a moving camera ruins every episode
  7. Quality over quantity — 30 perfect demos beat 100 bad ones
  8. Check data before training — automated quality scripts save hours of wasted GPU time
  9. 10 episodes is not enough — aim for 50-100 for pick-and-place tasks
  10. The model needs YOUR data — a fine-tuned model from someone else’s robot won’t work on yours without retraining

Resources

This is Collecting Training Data for VLA Robot Fine-Tuning (The Hard Way) by Akshay Parkhi, posted on 28th February 2026.

Next: How a VLA Controls a Robot Arm: GR00T N1.5 System Architecture from Camera to Motor

Previous: AWS Bedrock AgentCore Async Agents