The Complete Beginner’s Guide to Robotics — From Your First Camera to Foundation Models

10th March 2026

This is everything you need to go from zero to building, sensing, and coding robots at home. We’ll start with what hardware to buy, walk through how Tesla and Waymo actually see the road, write real depth estimation and PID control code you can run with just a camera, and end with how foundation models and RAG are transforming what robots can do.

Part 1: Building Your Home Robotics Lab

If you have a camera, you already have the most useful sensor in robotics. Here’s what else you need.

Controllers

Controller	Price	Best For	Key Feature
Raspberry Pi 4/5	$35–80	All-around beginner	Runs Linux, ROS 2, GPIO pins
NVIDIA Jetson Nano/Orin Nano	$100–200	AI and vision workloads	GPU for neural networks on camera input
Arduino Uno/Mega	$25–45	Actuator control	No OS, pairs well with Raspberry Pi

Actuators

Type	Price	Use Case
SG90 Micro Servo	$2–5	Small robot arms, precise 0°–180° angles
MG996R Servo	$8–12	Larger joints, more torque
DC Motor + L298N Driver	~$5	Wheels and drivetrain
NEMA 17 Stepper	$10–15	Precise position control (3D printer style)

Starter Kits

Kit	Price	What You Get
Elegoo Smart Robot Car	~$50	Wheeled robot with sensors, motors, controller
Freenove Big Hexapod Robot	~$80	6-legged walking robot for Raspberry Pi

What You Can Do Right Now With Just a Camera

Camera  →  Raspberry Pi / your laptop  →  ROS 2

# Install ROS 2 and test your camera immediately
sudo apt install ros-humble-usb-cam
ros2 run usb_cam usb_cam_node_exe

# View the stream
ros2 run rqt_image_view rqt_image_view

Even before hardware arrives, you can run object detection (YOLOv8), pose estimation, AprilTag tracking, and depth estimation — all from a laptop camera.

Recommended Learning Path

Camera + laptop — Learn ROS 2, computer vision, object detection
Add Raspberry Pi (~$50 total) — Learn GPIO, Linux, hardware communication
Add servos + motors (~$30 more) — Build a robot that reacts to what it sees

Part 2: Sensors — How Robots See the World

Camera Types

Type	How It Works	Best For
Monocular	Single lens, 2D images	Basic detection, image recognition, simple navigation
Stereo	Two lenses, mimics human binocular vision	Depth perception, 3D mapping, obstacle avoidance
RGB-D	RGB color + depth sensing combined	Advanced navigation, object manipulation (e.g. Intel RealSense)
Thermal	Detects infrared radiation / temperature	Dark environments, smoke, fog, search and rescue
Event-Based	High temporal resolution, low latency	High-speed motion detection and tracking

LIDAR Types

Type	Characteristics	Use Case
2D LIDAR	Simpler, cheaper, faster	Basic planar navigation, obstacle detection
3D LIDAR	Detailed spatial info, wide field of view	Autonomous vehicle mapping
Solid-State / Flash / MEMS	Durable, cost-effective	Production vehicles, consumer robotics

Other Common Sensors

Sensor	What It Does
Ultrasonic	Sound waves measure distance (used in vacuum robots)
RADAR	Distance, speed, and size — reliable in all conditions
GPS	Outdoor localization
IMU	Acceleration and rotation tracking
Wheel Encoders	Measure wheel rotation for distance traveled
Magnetic / Compass	Earth’s magnetic field for heading
Touch Sensors	Contact detection (household robots)

Part 3: How Tesla and Waymo Actually See the Road

Tesla FSD — Vision Only

Tesla uses 8 monocular cameras arranged for 360° coverage:

                [Forward Main]
             [Forward Narrow]

[Pillar Left]                    [Pillar Right]
[Repeater Left]                  [Repeater Right]

                [Rear]
                [Forward Wide]

Camera	Resolution	Range	Purpose
Forward Main	1.2 MP	150m	Primary forward vision
Forward Narrow	1.2 MP	250m	Long range, highway reading
Forward Wide	1.2 MP	60m	Close range, intersections
Pillar (x2)	1.2 MP	100m	Traffic lights, crossing objects
Repeater (x2)	1.2 MP	100m	Lane changes, blind spots
Rear	1.2 MP	50m	Reversing, rear traffic

All cameras run at 36 fps, processed by Tesla’s custom FSD Computer (144 TOPS). Tesla removed RADAR in 2021 (controversial), then reintroduced 4D imaging radar for some models.

Waymo — Full Multi-Modal Sensor Suite

Sensor	Specs	Purpose
Roof LIDAR	Custom spinning, 360°, ~200m	Primary 3D scene understanding
Short-Range LIDAR	4x solid-state (corners)	Close obstacle detection
Cameras	29 total	360° vision, color, texture
RADAR	6x units	Speed measurement, all-weather reliability

Head-to-Head Comparison

Dimension	Tesla FSD	Waymo
Sensor approach	Camera only (8 cameras)	Camera + LIDAR + RADAR (40+ sensors)
Depth sensing	Inferred via neural net	Direct via LIDAR point clouds
Architecture	End-to-end neural network (FSD v12)	Modular (separate perception/planning)
HD maps	No — builds understanding on the fly	Yes — pre-mapped areas only
Rain/Fog	Camera degrades	LIDAR + RADAR more robust
Cost per vehicle	~$1,000 sensor suite	~$100,000+ sensor suite
Fleet	Millions of consumer vehicles	Small dedicated test fleet
Geography	Anywhere (no pre-mapping)	Only pre-mapped areas

How Tesla Estimates Depth Without LIDAR

Tesla uses four methods simultaneously to figure out how far away the car in front is:

Method 1: Cross-Camera Geometry. Tesla’s 8 cameras are physically separated on the car. Same object seen from slightly different positions gives triangulation — synthetic stereo vision with a wide baseline.

Method 2: Monocular Depth Cues. Even from one camera, the neural network learns cues humans also use:

Visual Cue	What It Tells the Network
Object size	Small car = far away
Lane convergence	Vanishing point = distance scale
Occlusion	Car hidden behind another = closer
Texture density	Road markings getting smaller = distance
Shadow position	Height above ground
Position in frame	Lower in frame = closer

Method 3: Motion Parallax. As Tesla moves, nearby objects shift fast across frames while distant objects barely move. Speed of pixel movement + known car speed = distance. This is Structure from Motion (SfM).

Method 4: Occupancy Network. Tesla’s key innovation (FSD v10+). Eight camera feeds go into a neural network that outputs a 3D voxel grid of occupied vs. free space:

8 camera feeds (2D images)
        ↓
  Neural Network
        ↓
  3D Occupancy Grid

  ■ = occupied (car, person, wall)
  □ = free space (drivable road)

      □□□□□□□□□□□□□□
      □□□■■■□□■■■□□□   ← two cars in 3D grid
      □□□■■■□□■■■□□□
      □□□□□□□□□□□□□□
Tesla →  ←15m→  ←30m→

Why this is controversial: LIDAR measurement is always exact (speed of light, physics-based). Camera + neural net is a statistical estimate that could be wrong in rare cases. The 2016 Tesla fatal crash involved a white truck against a bright sky — poor contrast led to underestimated distance. LIDAR would have detected that truck regardless of color or lighting.

Tesla’s Fleet Learning Feedback Loop

Real car encounters edge case
        ↓
Data uploaded (driver intervention flagged)
        ↓
Added to training set
        ↓
Model retrained
        ↓
OTA update pushed to ~5 million Tesla vehicles
        ↓
All cars now handle that edge case better

No other autonomy company has this scale of real-world feedback.

Part 4: Test It Yourself — Depth Estimation With Your Camera

pip install opencv-python torch torchvision numpy timm ultralytics

Experiment 1: Monocular Depth (MiDaS)

Same concept Tesla uses — a neural network estimates depth from a single camera frame:

import cv2
import torch
import numpy as np

# Load MiDaS model (same type of approach Tesla uses)
model_type = "MiDaS_small"  # fast, runs on CPU
midas = torch.hub.load("intel-isl/MiDaS", model_type)
midas.eval()

transforms = torch.hub.load("intel-isl/MiDaS", "transforms")
transform = transforms.small_transform

cap = cv2.VideoCapture(0)  # your camera

while True:
    ret, frame = cap.read()
    if not ret:
        break

    img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    input_tensor = transform(img_rgb)

    with torch.no_grad():
        depth = midas(input_tensor)
        depth = torch.nn.functional.interpolate(
            depth.unsqueeze(1),
            size=frame.shape[:2],
            mode="bicubic",
            align_corners=False,
        ).squeeze()

    depth_map = depth.numpy()
    depth_normalized = cv2.normalize(
        depth_map, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U
    )
    depth_colored = cv2.applyColorMap(depth_normalized, cv2.COLORMAP_MAGMA)

    combined = np.hstack([frame, depth_colored])
    cv2.imshow("Camera | Depth Map (bright=close, dark=far)", combined)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Wave your hand toward the camera and watch the depth map change in real time.

Experiment 2: Motion Parallax — Depth From Movement

This replicates Tesla’s sequential frame depth method using optical flow:

import cv2
import numpy as np

cap = cv2.VideoCapture(0)

feature_params = dict(maxCorners=100, qualityLevel=0.3,
                      minDistance=7, blockSize=7)
lk_params = dict(winSize=(15, 15), maxLevel=2,
                 criteria=(cv2.TERM_CRITERIA_EPS |
                           cv2.TERM_CRITERIA_COUNT, 10, 0.03))

ret, old_frame = cap.read()
old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)
p0 = cv2.goodFeaturesToTrack(old_gray, mask=None, **feature_params)

while True:
    ret, frame = cap.read()
    frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    p1, st, err = cv2.calcOpticalFlowPyrLK(
        old_gray, frame_gray, p0, None, **lk_params
    )

    good_new = p1[st == 1]
    good_old = p0[st == 1]

    for new, old in zip(good_new, good_old):
        a, b = new.ravel().astype(int)
        c, d = old.ravel().astype(int)

        motion = np.sqrt((a - c)**2 + (b - d)**2)

        if motion > 5:
            color = (0, 0, 255)     # red = close
            label = f"CLOSE ({motion:.1f}px)"
        elif motion > 2:
            color = (0, 165, 255)   # orange = medium
            label = f"MED ({motion:.1f}px)"
        else:
            color = (0, 255, 0)     # green = far
            label = f"FAR ({motion:.1f}px)"

        cv2.circle(frame, (a, b), 5, color, -1)
        cv2.putText(frame, label, (a+5, b),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.4, color, 1)

    cv2.imshow("Motion Parallax (RED=close, GREEN=far)", frame)

    old_gray = frame_gray.copy()
    p0 = good_new.reshape(-1, 1, 2)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Move your hand toward the camera — tracked points turn red. Background stays green.

Experiment 3: Object Detection + Size-Based Distance

This replicates Tesla’s “known object size → distance” method:

import cv2
import numpy as np
from ultralytics import YOLO

model = YOLO("yolov8n.pt")  # downloads automatically

# Known real-world widths in meters
KNOWN_WIDTHS = {
    "person": 0.5, "car": 2.0, "truck": 2.5, "bicycle": 0.6,
    "bottle": 0.07, "cell phone": 0.07, "laptop": 0.35, "chair": 0.5,
}

FOCAL_LENGTH = 700  # pixels — calibrate for your camera

cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    results = model(frame, verbose=False)[0]

    for box in results.boxes:
        cls_name = model.names[int(box.cls)]
        x1, y1, x2, y2 = box.xyxy[0].int().tolist()
        confidence = float(box.conf)

        if cls_name in KNOWN_WIDTHS and confidence > 0.5:
            pixel_width = x2 - x1
            real_width = KNOWN_WIDTHS[cls_name]

            # Distance = (real_width * focal_length) / pixel_width
            distance = (real_width * FOCAL_LENGTH) / pixel_width

            label = f"{cls_name}: {distance:.1f}m"
            color = (0, 0, 255) if distance < 1.5 else (0, 255, 0)

            cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)
            cv2.putText(frame, label, (x1, y1 - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)

    cv2.imshow("Object Distance Estimation", frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Hold a bottle or phone at known distances (1m, 2m) and compare the estimate. To calibrate your focal length:

known_distance = 1.0     # meters
known_width_real = 0.07  # meters (e.g. a phone)
known_width_pixels = 55  # measure from your output frame

focal_length = (known_width_pixels * known_distance) / known_width_real
print(f"Your focal length: {focal_length:.1f} pixels")

Experiment 4: Combined System (Mini Tesla-Style)

Depth map + object detection running together:

import cv2
import torch
import numpy as np
from ultralytics import YOLO

midas = torch.hub.load("intel-isl/MiDaS", "MiDaS_small")
midas.eval()
transform = torch.hub.load("intel-isl/MiDaS", "transforms").small_transform
yolo = YOLO("yolov8n.pt")

cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Depth map
    img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    with torch.no_grad():
        depth = midas(transform(img_rgb))
        depth = torch.nn.functional.interpolate(
            depth.unsqueeze(1), size=frame.shape[:2],
            mode="bicubic", align_corners=False).squeeze().numpy()

    depth_norm = cv2.normalize(depth, None, 0, 255,
                               cv2.NORM_MINMAX, cv2.CV_8U)
    depth_colored = cv2.applyColorMap(depth_norm, cv2.COLORMAP_MAGMA)

    # Object detection
    results = yolo(frame, verbose=False)[0]
    for box in results.boxes:
        x1, y1, x2, y2 = box.xyxy[0].int().tolist()
        cls_name = yolo.names[int(box.cls)]

        # Sample depth at center of bounding box
        cx, cy = (x1 + x2) // 2, (y1 + y2) // 2
        relative_depth = depth[cy, cx]

        label = f"{cls_name} d={relative_depth:.0f}"
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, label, (x1, y1 - 8),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    combined = np.hstack([frame, depth_colored])
    cv2.imshow("Detection + Depth (Tesla-style)", combined)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Experiment	What It Teaches	Tesla Equivalent
1 — MiDaS	Neural nets estimate depth from one image	Occupancy network concept
2 — Optical Flow	Motion between frames reveals depth	Sequential frame analysis
3 — YOLO + Size	Known object sizes give absolute distance	Object-size prior
4 — Combined	Fuse estimates into one system	How Tesla fuses all methods

Part 5: PID Control — The 80-Year-Old Algorithm Running 95% of Robots

PID stands for Proportional, Integral, Derivative — three terms that work together to minimize the error between where you are and where you want to be.

The Core Idea

You want to be at position 100
You are currently at position 70
Error = 100 - 70 = 30

P → How big is the error RIGHT NOW?
I → Has there been a persistent error OVER TIME?
D → Is the error GROWING or SHRINKING?

Real-World Analogy: Maintaining Speed

Current speed: 45 mph    Target: 60 mph    Error: 15 mph

P (Proportional):  Big error → press gas hard
I (Integral):      Been slow for a while → press even more
D (Derivative):    Speed increasing fast → ease off slightly
                   (prevent overshooting to 75 mph)

Each Term Explained

P — Proportional (react to current error):

Output = Kp × error

Error=30, Kp=0.5  →  Output=15 (strong correction)
Error=10, Kp=0.5  →  Output=5  (medium correction)
Error=2,  Kp=0.5  →  Output=1  (small correction)

Problem: P alone always leaves a small remaining error (steady state error). It needs help.

I — Integral (react to accumulated error over time):

Output = Ki × sum of all past errors × time

Time:    0   1   2   3   4   5  seconds
Error:  10  10  10  10  10  10  (stuck 10° below target)

Integral keeps growing:
  t=1: 10
  t=2: 20
  t=3: 30  ← output keeps increasing until error is gone

Without I, temperature sits at 99°C forever. With I, the integral grows, pushes harder, eventually reaches 100°C exactly. But if Ki is too large, the system overshoots and oscillates wildly.

D — Derivative (react to rate of change):

Output = Kd × (change in error / change in time)

Error shrinking fast:  derivative negative → reduce output (don’t overshoot)
Error growing fast:    derivative positive → increase output urgently
Error stable:          derivative zero     → no D contribution

D is anticipation — it predicts where the system is heading and applies braking force for a smooth arrival.

The Full PID Formula

Output = Kp×error + Ki×∫error dt + Kd×(d error/dt)

Visual Intuition

Well-tuned PID:
Target ────────────────────  100
                 _____/
            ____/
  _________/
Actual

Too little D (oscillates):
Target ────────────────────  100
         /\        /\
        /  \      /  \      /\
─────/    \    /    \    /  \______
             \  /      \  /
              \/        \/

Kp too small (never reaches):
Target ────────────────────  100
                         ________  95
          ______________/
─────────

Full Simulation Code

import time
import matplotlib.pyplot as plt

class PID:
    def __init__(self, kp, ki, kd, setpoint):
        self.kp, self.ki, self.kd = kp, ki, kd
        self.setpoint = setpoint
        self.prev_error = 0
        self.integral = 0
        self.last_time = time.time()

    def compute(self, actual_value):
        current_time = time.time()
        dt = current_time - self.last_time
        if dt <= 0:
            dt = 0.001

        error = self.setpoint - actual_value

        P = self.kp * error
        self.integral += error * dt
        I = self.ki * self.integral
        derivative = (error - self.prev_error) / dt
        D = self.kd * derivative

        output = P + I + D
        self.prev_error = error
        self.last_time = current_time

        return output, P, I, D, error


class SimulatedSystem:
    def __init__(self, initial_value=0):
        self.value = initial_value
        self.velocity = 0

    def update(self, control_output, dt=0.05):
        self.velocity += control_output * dt * 0.5
        self.velocity *= 0.95  # friction/damping
        self.value += self.velocity * dt
        return self.value


# Run simulation
setpoint = 100.0
pid = PID(kp=2.0, ki=0.5, kd=1.0, setpoint=setpoint)
system = SimulatedSystem(initial_value=0)

times, actuals, targets = [], [], []
p_terms, i_terms, d_terms = [], [], []

t = 0
for step in range(100):
    actual = system.value
    output, P, I, D, error = pid.compute(actual)
    output = max(-50, min(50, output))  # clamp like real actuators
    system.update(output)

    times.append(t)
    actuals.append(actual)
    targets.append(setpoint)
    p_terms.append(P)
    i_terms.append(I)
    d_terms.append(D)

    t += 0.05
    time.sleep(0.01)

# Plot results
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))

ax1.plot(times, actuals, 'b-', label='Actual', linewidth=2)
ax1.plot(times, targets, 'r--', label='Target', linewidth=2)
ax1.set_ylabel('Value')
ax1.set_title('PID Control — System Response')
ax1.legend()
ax1.grid(True)

ax2.plot(times, p_terms, 'r-', label='P term', linewidth=1.5)
ax2.plot(times, i_terms, 'g-', label='I term', linewidth=1.5)
ax2.plot(times, d_terms, 'b-', label='D term', linewidth=1.5)
ax2.set_xlabel('Time (seconds)')
ax2.set_ylabel('Contribution')
ax2.set_title('PID — Individual Term Contributions')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.savefig('pid_response.png', dpi=150)
plt.show()

Try changing the gains and re-running:

Configuration	Kp	Ki	Kd	What Happens
Well tuned	2.0	0.5	1.0	Smooth response
Aggressive P	8.0	0.5	0.1	Oscillates
No D	2.0	0.5	0.0	Overshoots
Too much I	2.0	5.0	1.0	Slow windup, overshoots
P only	1.0	0.0	0.0	Steady state error

PID + Camera: Object Tracking (Tesla Lane Centering)

import cv2
from ultralytics import YOLO

class PID:
    def __init__(self, kp, ki, kd):
        self.kp, self.ki, self.kd = kp, ki, kd
        self.prev_error = 0
        self.integral = 0

    def compute(self, error, dt=0.033):
        self.integral += error * dt
        derivative = (error - self.prev_error) / dt
        output = self.kp * error + self.ki * self.integral + self.kd * derivative
        self.prev_error = error
        return output

model = YOLO("yolov8n.pt")
pid = PID(kp=0.5, ki=0.01, kd=0.1)

cap = cv2.VideoCapture(0)
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
center_x = frame_width // 2

while True:
    ret, frame = cap.read()
    if not ret:
        break

    results = model(frame, verbose=False)[0]

    for box in results.boxes:
        if model.names[int(box.cls)] == "person":
            x1, y1, x2, y2 = box.xyxy[0].int().tolist()
            obj_center_x = (x1 + x2) // 2

            error = obj_center_x - center_x
            correction = pid.compute(error)

            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.line(frame, (center_x, 0),
                     (center_x, frame.shape[0]), (255, 0, 0), 2)
            cv2.line(frame, (obj_center_x, 0),
                     (obj_center_x, frame.shape[0]), (0, 0, 255), 1)

            direction = "LEFT" if correction > 0 else "RIGHT"
            cv2.putText(frame, f"Error: {error}px", (10, 30),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)
            cv2.putText(frame, f"PID: {correction:.1f}", (10, 60),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)
            cv2.putText(frame, f"Steer: {direction}", (10, 90),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)

    cv2.imshow("PID Object Tracking (Tesla lane centering)", frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

This simulates the same logic Tesla uses for lane centering — measure error from desired position, apply PID correction, continuously minimize error.

Where Each Term Dominates in Tesla

Situation	Dominant Term
Car drifting from lane center	P (big error, correct now)
Persistent crosswind	I (keeps pushing off → I builds up)
Approaching sharp curve	D (error changing fast → anticipate)
Sudden obstacle	P + D (large error + fast change)

Tesla’s Full Control Stack

Neural Network (FSD)         ← high-level decisions
        ↓
MPC (Model Predictive Control) ← trajectory tracking
        ↓
PID Controllers               ← actuator commands
        ↓
Steering Motor / Throttle / Brakes

In FSD v12, the neural network increasingly replaces explicit MPC and PID with learned control — the planning and control steps are implicit inside the network weights. But PID still handles the lowest-level actuator precision.

Part 6: Why Robots Need Foundation Models

Traditional Robot:           Foundation Model Robot:
  "pick red ball"       vs     "grab the thing that
  (hardcoded)                   rolls and bounces"
  breaks if ball                works on any new object
  changes color                 it has never seen before

1. Robot Policy Learning

The old way: engineers write thousands of IF/THEN rules for every scenario. The problem: infinite edge cases (wet objects slip, fragile objects break, irregular shapes have no rule). Foundation models watch human demonstrations and learn generalizable policies that work on objects never seen in training.

2. Value Learning

Instead of fixed reward functions that get hacked (robot flips the box upside down to “complete” the task), value learning assigns cost to states. Low cost: object grasped securely, smooth motion. High cost: object falling, jerky motion, near collision. The robot chooses the lowest-cost path through state space.

3. High-Level Task Planning

Traditional: pre-program every sub-step of “make coffee” — fails if the mug is in a different location or someone says “brew me a cup” instead. LLMs understand natural language, generate plans automatically, and each step gets sent to lower-level controllers.

4. LLM Code Generation

New task needed? The old way: hire an engineer, weeks of coding. With LLM code generation (e.g. GenSim), describe the task in English and get simulation environment, reward function, training scenarios, and test cases — minutes instead of months.

5. Open Vocabulary Object Detection

Traditional detectors only recognize categories seen during training. A hospital robot sees an IV drip stand and outputs UNKNOWN. Open vocabulary models (like DINO) understand any text description: “the thing that holds the fluid bag on a pole” → detects IV drip stand correctly, never seen it during training.

6. Semantic Segmentation

Object detection draws a box. But a robot needs exact pixel boundaries to know where to grip:

Bounding box:   ████████   ← robot grips the air next to the cup
                ████████

Segmentation:     ░░░░░░
                ░░██████░░
                ░░██████──  ← handle identified separately
                ░░██████░░
                  ░░░░░░

Robot knows: grip the body (not the handle, too fragile), handle is an alternative grip point, bottom is a stable placement point.

7. 3D Representation and Affordance

Affordance = what can you DO with an object. Traditional robots see a cup and know it IS a cup, but don’t know which end to drink from, where the handle is grippable, or whether it will tip. Foundation models infer affordances from shape geometry, material properties, and physical common sense learned from internet data.

How It All Connects

Human: "put the fragile glass on the high shelf carefully"
        ↓
  LLM Task Planning
  "fragile" → gentle grip    "high shelf" → extend arm    "carefully" → slow
        ↓
  Open Vocab Detection
  Finds "glass" in scene — never seen this exact glass before
        ↓
  Semantic Segmentation
  Exact shape: rim, body, base locations identified
        ↓
  3D Affordance
  Best grip point, force needed, keep upright
        ↓
  Value Learning
  Lowest cost path = smooth, slow arc upward
        ↓
  Diffusion Policy
  Smooth grasping motion learned from human demos
        ↓
  PID / MPC
  Precise actuator control, maintains grip force
        ↓
  Glass placed safely on shelf

Summary

Problem	Solution
Can’t write rules for every scenario	Policy learning from demonstrations
Fixed reward functions get hacked	Value learning
Can’t pre-program every task	LLM task planning + code generation
Can’t recognize every object	Open vocabulary detection
Bounding boxes aren’t precise enough	Semantic segmentation
Don’t know how to interact with objects	3D affordance models

Part 7: RAG for Robots — How Semantic Memory Enables Navigation

Traditional RAG:           Robotics RAG:
  Documents → vectors       Camera frames → vectors
  "find relevant text"       "find relevant memory"
  answer questions           navigate and act in world

Everything a robot sees becomes a “document” stored in vector space.

How Robot Memory Gets Built

t=0s:   Camera captures kitchen
        VLM captions: "kitchen counter, chips bag on left shelf,
                       microwave on right, blue mug near sink"
        → embed caption → store vector + timestamp + location

t=30s:  Camera captures window area
        VLM captions: "large window, green plants, city view"
        → embed → store

t=60s:  Camera captures conference room
        VLM captions: "meeting table, whiteboard, projector"
        → embed → store

... robot builds semantic map of entire environment

How RAG Retrieval Works for Navigation

Human says: "get me the chips"
        ↓
  Speech → Text (Whisper)
        ↓
  Embed the query: "chips" → [0.23, 0.87, 0.12, ...]
        ↓
  Vector similarity search:
    Score 0.94: "kitchen counter, chips bag on left shelf"  ← MATCH
    Score 0.71: "break room, snack cabinet, various foods"
    Score 0.43: "office desk, papers, coffee cup"
        ↓
  Top result: kitchen, left shelf
        ↓
  Robot navigates and retrieves chips

The key: “chips” was never stored as an exact keyword. The vector embedding captures semantic meaning, so the query also finds “snack cabinet” and “Doritos on counter” — same concept, different words.

Full Architecture

┌─────────────────────────────────────────────┐
│  Camera (30fps) → VLM captions every 3s        │
│  Microphone → Whisper → text                    │
│  GPS/Odometry → location coordinates            │
└─────────────┬───────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────┐
│  Embed → Vector Database (FAISS/Milvus)         │
│  [vec1, vec2, vec3 ... millions of memories]     │
└─────────────┬───────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────┐
│  Query → embed → similarity search              │
│  Top K memories → inject into LLM context        │
│  LLM reasons: "Go to kitchen, chips on left"     │
└─────────────┬───────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────┐
│  Navigation → path to kitchen                   │
│  Manipulation → grasp chips bag                  │
│  Deliver to human                                │
└─────────────────────────────────────────────┘

Code: Build a Robot Memory System

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from datetime import datetime

# pip install faiss-cpu sentence-transformers

class RobotMemory:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.dimension = 384
        self.index = faiss.IndexFlatL2(self.dimension)
        self.metadata = []

    def store_observation(self, caption, location=(0, 0)):
        """Store a robot observation as a memory"""
        vector = self.model.encode([caption])[0]
        self.index.add(np.array([vector], dtype=np.float32))
        self.metadata.append({
            'caption': caption,
            'location': location,
            'timestamp': datetime.now().isoformat(),
            'id': len(self.metadata)
        })
        print(f"Stored memory #{len(self.metadata)-1}: "
              f"'{caption}' at {location}")

    def query(self, question, top_k=3):
        """Find most relevant memories for a question"""
        query_vector = self.model.encode([question])[0]
        distances, indices = self.index.search(
            np.array([query_vector], dtype=np.float32), top_k
        )
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx < len(self.metadata):
                memory = self.metadata[idx].copy()
                memory['similarity_score'] = float(1 / (1 + dist))
                results.append(memory)
        return results

    def decide_action(self, query):
        """RAG: retrieve memories + reason about action"""
        memories = self.query(query, top_k=3)

        print(f"\nQuery: '{query}'")
        print("Relevant memories found:")
        for m in memories:
            print(f"  [{m['similarity_score']:.2f}] "
                  f"'{m['caption']}' at {m['location']}")

        best = memories[0]
        print(f"\nAction: Navigate to {best['location']}")
        print(f"Reason: '{best['caption']}' found there")
        return best['location']


# Simulate robot exploring environment
robot = RobotMemory()

robot.store_observation(
    "kitchen counter with chips bag on left shelf", location=(10, 5))
robot.store_observation(
    "large window with green plants and city view", location=(25, 3))
robot.store_observation(
    "conference room with whiteboard and projector", location=(15, 20))
robot.store_observation(
    "break room with coffee machine and mugs", location=(8, 15))
robot.store_observation(
    "storage cabinet with snacks and beverages", location=(12, 6))
robot.store_observation(
    "office desk with computer monitor and papers", location=(20, 10))

print("\n" + "=" * 50)
print("ROBOT MEMORY SYSTEM READY")
print("=" * 50)

robot.decide_action("get me something to eat")
robot.decide_action("I want to see something beautiful outside")
robot.decide_action("where can I make coffee")
robot.decide_action("find me a place to have a meeting")

Output:

Query: 'get me something to eat'
Relevant memories found:
  [0.82] 'kitchen counter with chips bag on left shelf' at (10, 5)
  [0.79] 'storage cabinet with snacks and beverages' at (12, 6)
  [0.61] 'break room with coffee machine and mugs' at (8, 15)

Action: Navigate to (10, 5)
Reason: 'kitchen counter with chips bag on left shelf' found there

Why RAG Beats Traditional Navigation

Traditional Robot Navigation	RAG-Based Robot Memory
Needs explicit map: “chips at (10,5)”	Builds map automatically from observations
Breaks if object moves	Updates when robot re-explores
Can’t understand “something to eat”	Understands any natural language query
New objects require reprogramming	Generalizes to new objects and queries
No memory across sessions	Memory persists indefinitely

This is exactly what NVIDIA demonstrated — a robot that remembers what it has seen and navigates based on semantic memory, not hardcoded maps.

Posted 10th March 2026 at 7:36 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog