From Webcam to Robot Brain: How Vision-Language Models and Vision-Language-Action Models Actually Work
9th March 2026
I built a webcam app that sends live frames to Claude and GPT-4o for real-time scene understanding. Along the way, I discovered how fundamentally different this is from what robots like OpenVLA do with the same camera input. Here’s the full pipeline — from photons hitting your webcam sensor to tokens coming back from the cloud.
The Complete Pipeline: Webcam → LLM → Response
When you press SPACE in the app, here’s exactly what happens across 11 steps:
Step What Happens Where Time Data Size
—————————————————————————————————————
1. Light hits sensor Camera chip 0.001ms photons
2. Sensor → digital pixels Camera ADC 0.5ms 921,600 ints
3. USB transfer to RAM USB cable 1ms ~900 KB
4. JPEG compress Your CPU 2ms ~70 KB
5. Base64 encode Your CPU 0.1ms ~93 KB
6. Build JSON + HTTPS Your CPU 0.1ms ~95 KB
7. Travel to Anthropic Internet 20-50ms ~95 KB
8. Vision encoder (ViT) GPU cluster 50ms → embeddings
9. Transformer (80 layers) GPU cluster 500-2000ms → tokens
10. Travel back to you Internet 20-50ms ~1 KB
11. Print to terminal Your CPU 0.001ms text
Step 1-3: Camera Hardware (Physical → Digital)
Light enters the lens, hits the CMOS sensor — a silicon chip with millions of tiny photosites arranged in a Bayer filter pattern:
┌──┬──┬──┬──┬──┬──┐
│R │G │R │G │R │G │ ← Bayer color filter
├──┼──┼──┼──┼──┼──┤
│G │B │G │B │G │B │ Each photosite:
├──┼──┼──┼──┼──┼──┤ photon hits → electron freed
│R │G │R │G │R │G │ → voltage measured → 0 to 255
└──┴──┴──┴──┴──┴──┘
The ADC (Analog-to-Digital Converter) converts voltage (0.0V-3.3V) to integers (0-255). The result travels over USB into your Mac’s RAM.
Step 4: OpenCV Reads One Frame
cap.read() gives you a NumPy array. Here’s what a 4×4 crop of a face looks like:
Row 0: [178,143,120] [180,145,122] [175,140,118] [170,135,115]
Row 1: [160,125,100] [ 45, 30, 25] [ 42, 28, 22] [155,120, 95]
Row 2: [158,123, 98] [ 40, 25, 20] [ 38, 23, 18] [150,115, 90]
Row 3: [190,155,130] [185,150,125] [188,153,128] [192,157,132]
————— ————— —————
skin tone dark (eye) skin tone
Total: 480 × 640 × 3 = 921,600 integers in RAM (~900 KB)
Step 5-6: Compress and Encode for HTTP
cv2.imencode('.jpg', frame) compresses 900KB down to ~70KB using JPEG’s DCT (Discrete Cosine Transform). Then base64.b64encode() converts binary bytes to ASCII text so it can travel inside JSON:
Raw pixels: JPEG bytes: Base64 text:
┌───────────┐ ┌──────────┐ ┌──────────────────────────────────┐
│ 921,600 │ │ ~70,000 │ │ "/9j/4AAQSkZJRgABAQEAYABg..." │
│ integers │ │ bytes │ │ ~93,000 characters │
└───────────┘ └──────────┘ └──────────────────────────────────┘
13x smaller 33% larger than JPEG, but HTTP-safe
Step 7: The API Call
This is what actually goes to Claude:
{
"model": "claude-sonnet-4-20250514",
"max_tokens": 500,
"messages": [{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": "/9j/4AAQ..." ← ~93KB of your webcam frame
}
},
{
"type": "text",
"text": "What do you see?" ← your question
}
]
}]
}
This travels: Your Mac → WiFi → ISP → Internet → Anthropic servers.
Step 8-9: Inside the LLM
The image goes through a Vision Transformer (ViT) that splits it into 14×14 pixel patches, each converted into an embedding vector:
┌────┬────┬────┬────┐
│ P1 │ P2 │ P3 │ P4 │ Each patch = 14×14 pixels
├────┼────┼────┼────┤ Each patch → vector of ~1536 floats
│ P5 │ P6 │ P7 │ P8 │ [0.23, -0.87, 1.42, 0.05, ...]
├────┼────┼────┼────┤
│ P9 │P10 │P11 │P12 │ Similar objects → similar vectors
├────┼────┼────┼────┤
│P13 │P14 │P15 │P16 │
└────┴────┴────┴────┘
The Transformer then processes these image tokens alongside text tokens through ~80 layers of self-attention:
Layer 1: raw features (edges, colors)
Layer 10: parts (eyes, screens, hands)
Layer 30: objects (person, laptop, desk)
Layer 60: relationships (person AT desk)
Layer 80: semantic understanding
Output: probability of next word
"I" → "see" → "a" → "person" → "sitting"
Each word chosen one at a time (autoregressive)
The Key Insight: It’s a Photo, Not Video
The webcam runs at 30fps, but only ONE frame gets sent when you press SPACE. The other 999 frames are displayed on screen and thrown away:
WHAT YOU SEE WHAT ACTUALLY HAPPENS
————————— —————————————
Webcam shows live video cap.read() runs 30x/sec
(smooth motion) but frames are just DISPLAYED
│
You press SPACE
│
▼
┌───────────┐ ┌───────────┐
│ Frame 847 │ ← only │ JPEG │
│ (1 image) │ THIS one │ encode │
└───────────┘ gets sent │ base64 │
└─────┬─────┘
│
▼
Claude API
(sees 1 photo)
│
▼
"I see a person
with glasses..."
999 other frames THROWN AWAY
were never sent
The LLM has no concept of motion, what happened 1 second ago, or what will happen next. It’s a single-frame snapshot analysis.
Three Levels of Vision Intelligence
| Level | Approach | Input | Capability |
|---|---|---|---|
| Level 1 | Single frame + VLM | 1 photo | “What do you see?” → snapshot analysis |
| Level 2 | Multi-frame sequence | 5-10 photos | “What is happening?” → understands motion/activity |
| Level 3 | Continuous stream | Real-time frames + voice | “Talk to me about what you see” → OpenAI Realtime API, Gemini Live |
Our webcam app operates at Level 1. Each press of SPACE is an independent question about an independent photo.
VLM vs VLA: The Fundamental Difference
This is where it gets interesting. What we built is a Vision-Language Model (VLM). OpenVLA is a Vision-Language-Action model (VLA). Same camera input, radically different outputs:
VLM (Claude / GPT-4o) — Our Webcam App
————————————————————
┌────────┐ ┌───────────┐ ┌──────────────┐
│ Webcam │ → │ Claude / │ → │ "I see a │
│ frame │ │ GPT-4o │ │ person at │
└────────┘ └───────────┘ │ a desk" │
└──────────────┘
OUTPUT = WORDS (text description)
Can TALK about what it sees, but CANNOT act on it.
VLA (OpenVLA) — Robot Control
———————————————
┌────────┐ ┌─────────────┐
│ Camera │ │ "Pick up │
│ frame │ │ the red │
└────┬───┘ │ block" │
│ └──────┬──────┘
│ │
└──────┬───────┘
│
▼
┌─────────────┐
│ OpenVLA │
│ (7B params)│
│ │
│ Fused by LLM│
└──────┬──────┘
│
▼
┌──────────────────────┐
│ [+0.02, -0.01, │
│ +0.05, grip: close,│
│ rotate: 15°] │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ ROBOT ARM MOVES │
│ picks up block! │
└──────────────────────┘
OUTPUT = MOTOR COMMANDS (x,y,z coordinates, gripper open/close)
Can SEE, UNDERSTAND, and PHYSICALLY ACT.
OpenVLA Architecture (7 Billion Parameters)
OpenVLA uses a clever architecture that repurposes a language model to output robot actions instead of words:
┌──────────────────────────────────────────────────────────────┐
│ OpenVLA (7B parameters) │
│ │
│ ┌─────────────────────────────────────┐ │
│ │ DUAL VISION ENCODER │ │
│ │ ┌─────────┐ ┌─────────┐ │ │
│ │ │ SigLIP │ │ DinoV2 │ │ │
│ │ │ "what │ │ "where │ │ │
│ │ │ is it?" │ │ is it?" │ │ │
│ │ └────┬────┘ └────┬────┘ │ │
│ │ └─────┬─────┘ │ │
│ │ ▼ │ │
│ │ [image tokens] │ │
│ └─────────────┬──────────────────────┘ │
│ ▼ │
│ ┌───────────────────────────────┐ │
│ │ PROJECTOR (adapter layer) │ │
│ │ Maps image features into │ │
│ │ language model’s space │ │
│ └───────────────┬───────────────┘ │
│ ▼ │
│ ┌───────────────────────────────────────┐ │
│ │ LLAMA 2 (7B Language Model) │ │
│ │ │ │
│ │ Input: [image tokens] + "pick up │ │
│ │ the red block" │ │
│ │ │ │
│ │ Output: [action_token_1] │ │
│ │ [action_token_2] │ │
│ │ [action_token_3] ... │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ DE-TOKENIZE │ │
│ │ tokens → continuous numbers │ │
│ │ → [0.02, -0.01, 0.05, 0.8] │ │
│ └───────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ROBOT ARM EXECUTES MOVEMENT │
└──────────────────────────────────────────────────────────────┘
The key trick: instead of outputting word tokens like “the” or “cat”, the LLM outputs action tokens that get de-tokenized into continuous motor commands (position deltas, gripper state, rotation angles).
The AI Vision Evolution Ladder
| Level | Model Type | Sees? | Talks? | Moves? | Example |
|---|---|---|---|---|---|
| 1 | CNN / YOLO | ✓ | ✗ | ✗ | “box around dog” |
| 2 | VLM | ✓ | ✓ | ✗ | Claude, GPT-4o (our app) |
| 3 | VLA | ✓ | ✓ | ✓ | OpenVLA, RT-2 |
| 4 | World Model | ✓ | ✓ | ✓ + predicts future | Tesla FSD, 1X |
Our webcam app sits at Level 2. It can see and describe, but it cannot act. OpenVLA sits at Level 3 — it closes the loop from perception to physical action.
Side-by-Side: Same Input, Different Outputs
| System | Input | Output | Use Case |
|---|---|---|---|
| VLM (Claude/GPT) | image + question | TEXT: “I see a red cup” | Description, analysis, QA |
| YOLO | image | BOXES: [x1,y1,x2,y2, “cup”] | Object detection, counting |
| VLA (OpenVLA) | image + instruction | MOTOR: [+0.02, -0.01, grip_close] | Robot manipulation |
| Tesla FSD | 8 cameras + radar | STEERING: [angle, accel, brake] | Autonomous driving |
The Code: Surprisingly Simple
The entire webcam-to-Claude pipeline is under 80 lines of Python. The core loop:
import cv2, base64, anthropic
cap = cv2.VideoCapture(0)
client = anthropic.Anthropic()
while True:
ret, frame = cap.read()
cv2.imshow("LIVE", frame)
key = cv2.waitKey(1)
if key == ord(' '): # SPACE pressed
# Compress frame to JPEG
_, buffer = cv2.imencode('.jpg', frame)
# Encode to base64 for HTTP
img_b64 = base64.b64encode(buffer).decode('utf-8')
# Send to Claude
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/jpeg",
"data": img_b64
}},
{"type": "text", "text": "What do you see?"}
]
}]
)
print(response.content[0].text)
That’s it. OpenCV captures the frame, JPEG compresses it, base64 encodes it, and the Anthropic SDK sends it as a multimodal message. The entire intelligence lives in the API — your code is just plumbing.
What Claude Actually Said
Here’s a real response from the live demo:
“I can see a person sitting in what appears to be an indoor setting. The person is wearing clear-framed glasses (appearing to be aviator-style) and a red/burgundy colored shirt or sweater. Behind them, I can see white walls, a window with white horizontal blinds that are partially closed, allowing some natural light to filter through, and a framed piece of wall art that appears to contain some inspirational text.”
From a single 250KB JPEG, Claude identified: the person, their glasses style, shirt color, room layout, window type, blinds position, wall art, and lighting conditions. All from one frozen frame.
Where This Goes Next
The gap between Level 2 (VLM) and Level 3 (VLA) is where the most exciting work is happening. To bridge it, you need:
- A robot arm — something like the SO-100 ($300-500)
- A fine-tuned VLA — OpenVLA trained on 50+ demonstrations of your specific task
- A control loop — camera → model → action → camera → model → action, running at 5-10Hz
The webcam app we built is the first rung on this ladder. Same camera, same Python, same API pattern — but the output changes from words to physical movement.
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026