From Webcam to Robot Brain: How Vision-Language Models and Vision-Language-Action Models Actually Work

9th March 2026

I built a webcam app that sends live frames to Claude and GPT-4o for real-time scene understanding. Along the way, I discovered how fundamentally different this is from what robots like OpenVLA do with the same camera input. Here’s the full pipeline — from photons hitting your webcam sensor to tokens coming back from the cloud.

The Complete Pipeline: Webcam → LLM → Response

When you press SPACE in the app, here’s exactly what happens across 11 steps:

Step    What Happens              Where           Time        Data Size
—————————————————————————————————————
1. Light hits sensor        Camera chip       0.001ms     photons
2. Sensor → digital pixels  Camera ADC        0.5ms       921,600 ints
3. USB transfer to RAM      USB cable         1ms         ~900 KB
4. JPEG compress            Your CPU          2ms         ~70 KB
5. Base64 encode            Your CPU          0.1ms       ~93 KB
6. Build JSON + HTTPS       Your CPU          0.1ms       ~95 KB
7. Travel to Anthropic      Internet          20-50ms     ~95 KB
8. Vision encoder (ViT)     GPU cluster       50ms        → embeddings
9. Transformer (80 layers)  GPU cluster       500-2000ms  → tokens
10. Travel back to you      Internet          20-50ms     ~1 KB
11. Print to terminal       Your CPU          0.001ms     text

Step 1-3: Camera Hardware (Physical → Digital)

Light enters the lens, hits the CMOS sensor — a silicon chip with millions of tiny photosites arranged in a Bayer filter pattern:

┌──┬──┬──┬──┬──┬──┐
│R │G │R │G │R │G │  ← Bayer color filter
├──┼──┼──┼──┼──┼──┤
│G │B │G │B │G │B │    Each photosite:
├──┼──┼──┼──┼──┼──┤    photon hits → electron freed
│R │G │R │G │R │G │    → voltage measured → 0 to 255
└──┴──┴──┴──┴──┴──┘

The ADC (Analog-to-Digital Converter) converts voltage (0.0V-3.3V) to integers (0-255). The result travels over USB into your Mac’s RAM.

Step 4: OpenCV Reads One Frame

cap.read() gives you a NumPy array. Here’s what a 4×4 crop of a face looks like:

Row 0: [178,143,120] [180,145,122] [175,140,118] [170,135,115]
Row 1: [160,125,100] [ 45, 30, 25] [ 42, 28, 22] [155,120, 95]
Row 2: [158,123, 98] [ 40, 25, 20] [ 38, 23, 18] [150,115, 90]
Row 3: [190,155,130] [185,150,125] [188,153,128] [192,157,132]
        —————    —————    —————
        skin tone      dark (eye)    skin tone

Total: 480 × 640 × 3 = 921,600 integers in RAM (~900 KB)

Step 5-6: Compress and Encode for HTTP

cv2.imencode('.jpg', frame) compresses 900KB down to ~70KB using JPEG’s DCT (Discrete Cosine Transform). Then base64.b64encode() converts binary bytes to ASCII text so it can travel inside JSON:

Raw pixels:     JPEG bytes:      Base64 text:
┌───────────┐   ┌──────────┐     ┌──────────────────────────────────┐
│ 921,600   │   │ ~70,000  │     │ "/9j/4AAQSkZJRgABAQEAYABg..."   │
│ integers  │   │ bytes    │     │ ~93,000 characters               │
└───────────┘   └──────────┘     └──────────────────────────────────┘
                  13x smaller      33% larger than JPEG, but HTTP-safe

Step 7: The API Call

This is what actually goes to Claude:

{
  "model": "claude-sonnet-4-20250514",
  "max_tokens": 500,
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "image",
        "source": {
          "type": "base64",
          "media_type": "image/jpeg",
          "data": "/9j/4AAQ..."       ← ~93KB of your webcam frame
        }
      },
      {
        "type": "text",
        "text": "What do you see?"    ← your question
      }
    ]
  }]
}

This travels: Your Mac → WiFi → ISP → Internet → Anthropic servers.

Step 8-9: Inside the LLM

The image goes through a Vision Transformer (ViT) that splits it into 14×14 pixel patches, each converted into an embedding vector:

┌────┬────┬────┬────┐
│ P1 │ P2 │ P3 │ P4 │  Each patch = 14×14 pixels
├────┼────┼────┼────┤  Each patch → vector of ~1536 floats
│ P5 │ P6 │ P7 │ P8 │  [0.23, -0.87, 1.42, 0.05, ...]
├────┼────┼────┼────┤
│ P9 │P10 │P11 │P12 │  Similar objects → similar vectors
├────┼────┼────┼────┤
│P13 │P14 │P15 │P16 │
└────┴────┴────┴────┘

The Transformer then processes these image tokens alongside text tokens through ~80 layers of self-attention:

Layer 1:   raw features (edges, colors)
Layer 10:  parts (eyes, screens, hands)
Layer 30:  objects (person, laptop, desk)
Layer 60:  relationships (person AT desk)
Layer 80:  semantic understanding

Output: probability of next word
"I" → "see" → "a" → "person" → "sitting"
Each word chosen one at a time (autoregressive)

The Key Insight: It’s a Photo, Not Video

The webcam runs at 30fps, but only ONE frame gets sent when you press SPACE. The other 999 frames are displayed on screen and thrown away:

WHAT YOU SEE                    WHAT ACTUALLY HAPPENS
—————————               —————————————
Webcam shows live video         cap.read() runs 30x/sec
    (smooth motion)             but frames are just DISPLAYED
           │
     You press SPACE
           │
           ▼
     ┌───────────┐              ┌───────────┐
     │ Frame 847 │  ← only      │  JPEG     │
     │ (1 image) │  THIS one    │  encode   │
     └───────────┘  gets sent   │  base64   │
                                └─────┬─────┘
                                      │
                                      ▼
                                 Claude API
                                 (sees 1 photo)
                                      │
                                      ▼
                                 "I see a person
                                  with glasses..."

     999 other frames           THROWN AWAY
     were never sent

The LLM has no concept of motion, what happened 1 second ago, or what will happen next. It’s a single-frame snapshot analysis.

Three Levels of Vision Intelligence

Level	Approach	Input	Capability
Level 1	Single frame + VLM	1 photo	“What do you see?” → snapshot analysis
Level 2	Multi-frame sequence	5-10 photos	“What is happening?” → understands motion/activity
Level 3	Continuous stream	Real-time frames + voice	“Talk to me about what you see” → OpenAI Realtime API, Gemini Live

Our webcam app operates at Level 1. Each press of SPACE is an independent question about an independent photo.

VLM vs VLA: The Fundamental Difference

This is where it gets interesting. What we built is a Vision-Language Model (VLM). OpenVLA is a Vision-Language-Action model (VLA). Same camera input, radically different outputs:

VLM (Claude / GPT-4o) — Our Webcam App
————————————————————
┌────────┐      ┌───────────┐      ┌──────────────┐
│ Webcam │  →  │ Claude /  │  →  │ "I see a     │
│ frame  │      │ GPT-4o    │      │  person at   │
└────────┘      └───────────┘      │  a desk"     │
                                   └──────────────┘
OUTPUT = WORDS (text description)
Can TALK about what it sees, but CANNOT act on it.


VLA (OpenVLA) — Robot Control
———————————————

  ┌────────┐   ┌─────────────┐
  │ Camera │   │ "Pick up    │
  │ frame  │   │  the red    │
  └────┬───┘   │  block"     │
       │       └──────┬──────┘
       │              │
       └──────┬───────┘
              │
              ▼
       ┌─────────────┐
       │   OpenVLA   │
       │  (7B params)│
       │             │
       │ Fused by LLM│
       └──────┬──────┘
              │
              ▼
  ┌──────────────────────┐
  │ [+0.02, -0.01,      │
  │  +0.05, grip: close,│
  │  rotate: 15°]        │
  └──────────┬───────────┘
             │
             ▼
  ┌──────────────────────┐
  │   ROBOT ARM MOVES    │
  │   picks up block!    │
  └──────────────────────┘

OUTPUT = MOTOR COMMANDS (x,y,z coordinates, gripper open/close)
Can SEE, UNDERSTAND, and PHYSICALLY ACT.

OpenVLA Architecture (7 Billion Parameters)

OpenVLA uses a clever architecture that repurposes a language model to output robot actions instead of words:

┌──────────────────────────────────────────────────────────────┐
│  OpenVLA (7B parameters)                                     │
│                                                              │
│  ┌─────────────────────────────────────┐                     │
│  │  DUAL VISION ENCODER                │                     │
│  │  ┌─────────┐ ┌─────────┐           │                     │
│  │  │ SigLIP  │ │ DinoV2  │           │                     │
│  │  │ "what   │ │ "where  │           │                     │
│  │  │ is it?" │ │ is it?" │           │                     │
│  │  └────┬────┘ └────┬────┘           │                     │
│  │       └─────┬─────┘                │                     │
│  │             ▼                      │                     │
│  │       [image tokens]               │                     │
│  └─────────────┬──────────────────────┘                     │
│                ▼                                             │
│  ┌───────────────────────────────┐                           │
│  │  PROJECTOR (adapter layer)    │                           │
│  │  Maps image features into     │                           │
│  │  language model’s space       │                           │
│  └───────────────┬───────────────┘                           │
│                  ▼                                           │
│  ┌───────────────────────────────────────┐                   │
│  │  LLAMA 2 (7B Language Model)          │                   │
│  │                                       │                   │
│  │  Input: [image tokens] + "pick up     │                   │
│  │          the red block"               │                   │
│  │                                       │                   │
│  │  Output: [action_token_1]             │                   │
│  │          [action_token_2]             │                   │
│  │          [action_token_3] ...         │                   │
│  │              │                        │                   │
│  │              ▼                        │                   │
│  │         DE-TOKENIZE                   │                   │
│  │         tokens → continuous numbers  │                   │
│  │         → [0.02, -0.01, 0.05, 0.8]  │                   │
│  └───────────────────────────────────────┘                   │
│              │                                               │
│              ▼                                               │
│         ROBOT ARM EXECUTES MOVEMENT                          │
└──────────────────────────────────────────────────────────────┘

The key trick: instead of outputting word tokens like “the” or “cat”, the LLM outputs action tokens that get de-tokenized into continuous motor commands (position deltas, gripper state, rotation angles).

The AI Vision Evolution Ladder

Level	Model Type	Sees?	Talks?	Moves?	Example
1	CNN / YOLO	✓	✗	✗	“box around dog”
2	VLM	✓	✓	✗	Claude, GPT-4o (our app)
3	VLA	✓	✓	✓	OpenVLA, RT-2
4	World Model	✓	✓	✓ + predicts future	Tesla FSD, 1X

Our webcam app sits at Level 2. It can see and describe, but it cannot act. OpenVLA sits at Level 3 — it closes the loop from perception to physical action.

Side-by-Side: Same Input, Different Outputs

System	Input	Output	Use Case
VLM (Claude/GPT)	image + question	TEXT: “I see a red cup”	Description, analysis, QA
YOLO	image	BOXES: [x1,y1,x2,y2, “cup”]	Object detection, counting
VLA (OpenVLA)	image + instruction	MOTOR: [+0.02, -0.01, grip_close]	Robot manipulation
Tesla FSD	8 cameras + radar	STEERING: [angle, accel, brake]	Autonomous driving

The Code: Surprisingly Simple

The entire webcam-to-Claude pipeline is under 80 lines of Python. The core loop:

import cv2, base64, anthropic

cap = cv2.VideoCapture(0)
client = anthropic.Anthropic()

while True:
    ret, frame = cap.read()
    cv2.imshow("LIVE", frame)

    key = cv2.waitKey(1)
    if key == ord(' '):  # SPACE pressed
        # Compress frame to JPEG
        _, buffer = cv2.imencode('.jpg', frame)

        # Encode to base64 for HTTP
        img_b64 = base64.b64encode(buffer).decode('utf-8')

        # Send to Claude
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {
                        "type": "base64",
                        "media_type": "image/jpeg",
                        "data": img_b64
                    }},
                    {"type": "text", "text": "What do you see?"}
                ]
            }]
        )
        print(response.content[0].text)

That’s it. OpenCV captures the frame, JPEG compresses it, base64 encodes it, and the Anthropic SDK sends it as a multimodal message. The entire intelligence lives in the API — your code is just plumbing.

What Claude Actually Said

Here’s a real response from the live demo:

“I can see a person sitting in what appears to be an indoor setting. The person is wearing clear-framed glasses (appearing to be aviator-style) and a red/burgundy colored shirt or sweater. Behind them, I can see white walls, a window with white horizontal blinds that are partially closed, allowing some natural light to filter through, and a framed piece of wall art that appears to contain some inspirational text.”

From a single 250KB JPEG, Claude identified: the person, their glasses style, shirt color, room layout, window type, blinds position, wall art, and lighting conditions. All from one frozen frame.

Where This Goes Next

The gap between Level 2 (VLM) and Level 3 (VLA) is where the most exciting work is happening. To bridge it, you need:

A robot arm — something like the SO-100 ($300-500)
A fine-tuned VLA — OpenVLA trained on 50+ demonstrations of your specific task
A control loop — camera → model → action → camera → model → action, running at 5-10Hz

The webcam app we built is the first rung on this ladder. Same camera, same Python, same API pattern — but the output changes from words to physical movement.

Posted 9th March 2026 at 6:46 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog