Collecting Training Data for VLA Robot Fine-Tuning (The Hard Way)
28th February 2026
A Vision-Language-Action model takes camera images and a language instruction as input, and outputs robot joint actions. NVIDIA’s GR00T N1.5 is one such model — pre-trained on millions of robot demonstrations and fine-tunable for your specific robot and task. The catch: even though GR00T is pre-trained, you still need your own demonstrations to teach it your robot’s exact joint calibration, camera angles, and task environment. Without this, the model generates actions that are plausible in general but wrong for your specific setup.
Here’s everything I learned building a pick-and-place dataset for GR00T on an SO-101 arm — including every painful mistake along the way.
Hardware Setup
- Robot: SO-101 follower arm (Feetech STS3215 motors)
- Front camera: Logitech C270 USB webcam
- Wrist camera: Generic USB “Web Camera”
- Computer: Windows 11 + WSL2 (Ubuntu 22.04)
- GPU: NVIDIA RTX (for policy server)
What is the LeRobot Data Format
LeRobot is Hugging Face’s robot learning framework. Its data format is the standard for recording, sharing, and training on robot demonstrations. A LeRobot dataset looks like this:
banana-pick/
├── meta/
│ ├── info.json ← dataset metadata (fps, motor count, camera config)
│ ├── episodes.jsonl ← per-episode info (length, task description)
│ └── modality.json ← GR00T-specific: maps columns to model inputs
├── data/
│ └── train-00000-of-00001.parquet ← joint states + actions (tabular)
└── videos/
├── front_rgb_episode_000000.mp4 ← front camera video
└── wrist_rgb_episode_000000.mp4 ← wrist camera video
The parquet files contain timestamped rows with joint positions (state), joint targets (action), and episode indices. The videos are synchronized MP4 files at the same FPS. This format is what GR00T’s fine-tuning script, LeRobot’s training pipeline, and Hugging Face Hub all expect. When you run lerobot.record, it writes directly into this format.
Step 1: Get the Robot and Cameras Working
Problem: Cameras timed out in WSL2
RuntimeError: OpenCVCamera(2) read failed (status=False)
[ WARN] global cap_v4l.cpp:1049 tryIoctl VIDEOIO(V4L2:/dev/video2): select() timeout.
Root cause: WSL2’s usbipd USB passthrough uses YUYV (uncompressed) pixel format by default. At 640x480 @ 30fps, this is ~27 MB/s per camera — too much for the USB-over-IP tunnel.
Fix: Force MJPEG codec right after VideoCapture opens. Edit the LeRobot camera driver:
# /path/to/lerobot/cameras/opencv/camera_opencv.py
self.videocapture = cv2.VideoCapture(self.index_or_path, self.backend)
# Add this line:
self.videocapture.set(cv2.CAP_PROP_FOURCC,
cv2.VideoWriter_fourcc('M', 'J', 'P', 'G'))
MJPEG compresses frames before USB transfer, reducing bandwidth ~10x.
Problem: Camera indices change after every usbipd re-attach
After detaching and re-attaching cameras in PowerShell, /dev/video0 might become /dev/video4. Always check:
for i in 0 1 2 3 4 5; do
name=$(cat /sys/class/video4linux/video$i/name 2>/dev/null)
echo "video$i: $name"
done
Or use LeRobot’s built-in tool:
python -m lerobot.find_cameras opencv
Problem: Two cameras simultaneously = corruption
Running two cameras at the same time over WSL2 usbipd causes partial JPEG frames — you see the top strip of a real image then solid blue blocks below it.
Corrupt JPEG data: 196 extraneous bytes before marker 0xdd
Fix: Use only one camera for data collection (the front camera). Single-camera GR00T models work well for pick-and-place. Use --data-config so100 instead of so100_dualcam when fine-tuning.
Permanent fix: Run Linux natively (dual boot) — no USB passthrough bottleneck.
Step 2: Fix the Policy Server Serialization
When the GR00T policy server (running in the gr00t conda env) and the Dum-E controller (running in the dum-e conda env) use different PyTorch versions, you get:
_pickle.UnpicklingError: unpickling stack underflow
Root cause: The controller used torch.save/load (TorchSerializer) but the Isaac-GR00T server uses msgpack (MsgSerializer). They speak different protocols over ZMQ.
Fix: Update the Dum-E client policy/gr00t/service.py to use msgpack:
import io, msgpack, numpy as np, zmq
class MsgSerializer:
@staticmethod
def to_bytes(data: dict) -> bytes:
return msgpack.packb(data,
default=MsgSerializer.encode_custom_classes)
@staticmethod
def from_bytes(data: bytes) -> dict:
return msgpack.unpackb(data,
object_hook=MsgSerializer.decode_custom_classes)
@staticmethod
def decode_custom_classes(obj):
if "__ndarray_class__" in obj:
obj = np.load(io.BytesIO(obj["as_npy"]),
allow_pickle=False)
return obj
@staticmethod
def encode_custom_classes(obj):
if isinstance(obj, np.ndarray):
output = io.BytesIO()
np.save(output, obj, allow_pickle=False)
return {"__ndarray_class__": True,
"as_npy": output.getvalue()}
return obj
Then install msgpack in the client env: pip install msgpack
Debug tip: If you’re not sure what the server is returning, inspect the raw ZMQ bytes:
import zmq
ctx = zmq.Context()
sock = ctx.socket(zmq.REQ)
sock.connect('tcp://localhost:5555')
sock.send(b'test')
msg = sock.recv()
print(f'First 50 bytes: {msg[:50]}')
Step 3: Data Collection Without a Leader Arm
The standard approach for SO-101 data collection is teleoperation with a leader arm — a second SO-101 you hold and move while the follower mirrors it. But at ~$100/arm, it’s a significant cost. There are alternatives.
Teleoperation Options
| Option | Cost | Quality | Notes |
|---|---|---|---|
| Leader arm (second SO-101) | ~$100 | Best | What Dum-E was designed for |
| SpaceMouse | ~$100 | Good | 6DOF input, --teleop.type=spacemouse |
| Gamepad / Xbox controller | ~$30 | Good | --teleop.type=gamepad, much smoother than keyboard |
| Keyboard | Free | Poor | Slow, jerky motions — quality matters a lot for VLA |
| Existing dataset | Free | Varies | Download from HuggingFace, skip collection entirely |
| Kinesthetic teaching | Free | Good | Move the arm by hand with motors disabled |
Why Keyboard Teleoperation Does NOT Work
LeRobot’s --teleop.type=keyboard sends raw key names as actions. The so101_follower send_action() expects joint position arrays. They’re incompatible — you’ll get:
StopIteration at motors_bus.py:1179 in sync_write("Goal_Position", ...)
Keyboard teleoperation only works with So100FollowerEndEffector (end-effector Cartesian control mode), not standard joint control.
Kinesthetic Teaching (Free, No Extra Hardware)
Disable motor torque — the arm goes limp and can be moved by hand. Physically guide it through the task while recording joint positions + camera at 30fps. Re-enable torque after each episode.
import time, numpy as np, cv2
from lerobot.cameras.opencv.camera_opencv import OpenCVCamera
from lerobot.cameras.opencv.configuration_opencv import OpenCVCameraConfig
from lerobot.robots import make_robot_from_config
from lerobot.robots.so101_follower import SO101FollowerConfig
# Connect robot without cameras
robot_cfg = SO101FollowerConfig(
id="my_robot", port="/dev/ttyACM0", cameras={})
robot = make_robot_from_config(robot_cfg)
robot.connect(calibrate=False)
# Connect camera separately
cam = OpenCVCamera(OpenCVCameraConfig(
index_or_path=1, fps=30, width=640, height=480))
cam.connect()
state_keys = ["shoulder_pan.pos", "shoulder_lift.pos",
"elbow_flex.pos", "wrist_flex.pos",
"wrist_roll.pos", "gripper.pos"]
# Record one episode — arm goes limp, you move it by hand
frames = []
with robot.bus.torque_disabled():
start = time.time()
while time.time() - start < 10: # 10 second episode
obs = robot.get_observation()
state = np.array([obs[k] for k in state_keys])
img = cam.read()
frames.append({"state": state, "image": img})
time.sleep(1/30)
print(f"Recorded {len(frames)} frames")
Using an Existing Dataset
Skip data collection entirely — download someone else’s fruit-picking dataset and fine-tune on that:
huggingface-cli download \
--repo-type dataset youliangtan/so100_strawberry_grape \
--local-dir ./demo_data/so100_strawberry_grape
This dataset is SO-100 with similar pick-and-place tasks. It won’t be perfect for your exact setup but is better than no data. Recommended: start with an existing dataset to validate the pipeline, then collect your own data with a gamepad or second arm.
Step 4: Data Quality Checklist
Bad data is worse than no data. Check every episode before keeping it.
import json, numpy as np, cv2
from pathlib import Path
base = Path('./demo_data/banana-pick')
for ep in sorted(base.glob('episode_*')):
states = np.load(ep / 'states.npy')
cap = cv2.VideoCapture(str(ep / 'front.mp4'))
ret, frame = cap.read()
cap.release()
sharpness = cv2.Laplacian(
cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY), cv2.CV_64F
).var() if ret else 0
motion = np.abs(states[-1] - states[0]).sum()
brightness = cv2.cvtColor(
frame, cv2.COLOR_BGR2GRAY).mean() if ret else 0
ok = motion > 40 and sharpness > 150 and brightness > 30
status = "OK" if ok else "BAD"
print(f'{ep.name}: motion={motion:.0f} '
f'sharpness={sharpness:.0f} [{status}]')
Quality Thresholds
| Metric | Minimum | Notes |
|---|---|---|
| Sharpness (Laplacian variance) | > 150 | < 50 = camera covered or blurred |
| Motion (total joint degrees) | > 40 | < 20 = arm barely moved |
| Brightness | > 30 | < 30 = too dark, model can’t see |
| Frames per episode | ~300 (10s @ 30fps) | Fewer = incomplete demo |
Common Reasons for Bad Episodes
| Problem | Symptom | Fix |
|---|---|---|
| Camera moved during recording | Sharpness drops mid-episode | Mount camera with tape/clamp |
| Camera facing wrong direction | Low sharpness, no arm visible | Reposition before every session |
| Arm barely moved | Low motion score | Redo — complete full pick-and-place |
| MJPEG corruption (WSL2) | Blue block artifacts, sharpness = 3 | Use single camera or native Linux |
| Too dark | Low brightness | Add a desk lamp above workspace |
Step 5: Camera Positioning
The most underrated part of data collection. The model learns entirely from what the camera sees — if the camera moves between episodes, or doesn’t show the object and arm clearly, the model learns noise.
- Fix the camera — tape it, clamp it, don’t touch it between episodes
- Show the whole task — banana, plate, and robot arm must all be visible
- Consistent position — same angle every session, every day
- Good lighting — natural or desk lamp, avoid shadows on the workspace
- No background clutter — plain table surface is better than a messy desk
Step 6: Recording with LeRobot
If you have a leader arm or gamepad, the standard LeRobot recording command:
conda activate dum-e
python -m lerobot.record \
--robot.type=so101_follower \
--robot.port=/dev/ttyACM0 \
--robot.id=my_awesome_follower_arm \
--robot.cameras="{ front: {type: opencv, index_or_path: 0,
width: 640, height: 480, fps: 30}}" \
--dataset.repo_id=YOUR_HF_USERNAME/banana-pick \
--dataset.single_task="Pick up the banana and put it on the plate." \
--dataset.num_episodes=50
Tips for good demonstrations:
- Aim for 50-100 episodes — more is better
- Keep the camera position identical every session
- Vary the banana position slightly across episodes so the model generalizes
- Slow, smooth motions — jerky demos teach jerky behavior
- Each episode: start from rest, pick banana, place on plate, stop
Step 7: Fine-Tune GR00T
Once you have 50+ good episodes:
# 1. Copy the modality config
cp Isaac-GR00T/examples/so100__modality.json \
./demo_data/banana-pick/meta/modality.json
# 2. Fine-tune (in gr00t conda env)
conda activate gr00t
cd Isaac-GR00T
python scripts/gr00t_finetune.py \
--dataset-path ./demo_data/banana-pick \
--num-gpus 1 \
--batch-size 32 \
--output-dir ~/banana-checkpoint \
--max-steps 10000 \
--data-config so100 \
--video-backend torchvision_av
# 3. Start policy server with your checkpoint
python scripts/inference_service.py \
--server \
--model_path ~/banana-checkpoint \
--embodiment_tag new_embodiment \
--data_config so100
# 4. Run the robot (in dum-e conda env)
conda activate dum-e
python -m embodiment.so_arm10x.controller \
--config my-dum-e.yaml \
--instruction "Pick up the banana and put it on the plate."
Reduce --batch-size if you run out of VRAM (try 16 or 8). 10k steps takes ~2-4 hours on an RTX 3060.
Lessons Learned
- WSL2 + two cameras = trouble — use one camera or boot Linux natively
- Camera indices change after every usbipd re-attach — always check before recording
- MJPEG fixes WSL2 camera timeouts — set fourcc before the first read
- Serialization must match — server and client must use the same format (msgpack vs torch.save)
- Keyboard teleoperation doesn’t work with joint-space robots — use kinesthetic teaching or a gamepad
- Mount the camera securely — a moving camera ruins every episode
- Quality over quantity — 30 perfect demos beat 100 bad ones
- Check data before training — automated quality scripts save hours of wasted GPU time
- 10 episodes is not enough — aim for 50-100 for pick-and-place tasks
- The model needs YOUR data — a fine-tuned model from someone else’s robot won’t work on yours without retraining
Resources
- Isaac-GR00T — GR00T model and fine-tuning scripts
- Dum-E — Full voice+vision robot agent built on GR00T
- LeRobot — Robot learning framework (data collection, training, data format)
- SO-101 Assembly Guide — Hardware setup
- usbipd-win — USB passthrough for WSL2
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026