Akshay Parkhi's Weblog

Subscribe

The Complete Beginner’s Guide to Robotics — From Your First Camera to Foundation Models

10th March 2026

This is everything you need to go from zero to building, sensing, and coding robots at home. We’ll start with what hardware to buy, walk through how Tesla and Waymo actually see the road, write real depth estimation and PID control code you can run with just a camera, and end with how foundation models and RAG are transforming what robots can do.

Part 1: Building Your Home Robotics Lab

If you have a camera, you already have the most useful sensor in robotics. Here’s what else you need.

Controllers

ControllerPriceBest ForKey Feature
Raspberry Pi 4/5$35–80All-around beginnerRuns Linux, ROS 2, GPIO pins
NVIDIA Jetson Nano/Orin Nano$100–200AI and vision workloadsGPU for neural networks on camera input
Arduino Uno/Mega$25–45Actuator controlNo OS, pairs well with Raspberry Pi

Actuators

TypePriceUse Case
SG90 Micro Servo$2–5Small robot arms, precise 0°–180° angles
MG996R Servo$8–12Larger joints, more torque
DC Motor + L298N Driver~$5Wheels and drivetrain
NEMA 17 Stepper$10–15Precise position control (3D printer style)

Starter Kits

KitPriceWhat You Get
Elegoo Smart Robot Car~$50Wheeled robot with sensors, motors, controller
Freenove Big Hexapod Robot~$806-legged walking robot for Raspberry Pi

What You Can Do Right Now With Just a Camera

Camera  →  Raspberry Pi / your laptop  →  ROS 2

# Install ROS 2 and test your camera immediately
sudo apt install ros-humble-usb-cam
ros2 run usb_cam usb_cam_node_exe

# View the stream
ros2 run rqt_image_view rqt_image_view

Even before hardware arrives, you can run object detection (YOLOv8), pose estimation, AprilTag tracking, and depth estimation — all from a laptop camera.

Recommended Learning Path

  1. Camera + laptop — Learn ROS 2, computer vision, object detection
  2. Add Raspberry Pi (~$50 total) — Learn GPIO, Linux, hardware communication
  3. Add servos + motors (~$30 more) — Build a robot that reacts to what it sees

Part 2: Sensors — How Robots See the World

Camera Types

TypeHow It WorksBest For
MonocularSingle lens, 2D imagesBasic detection, image recognition, simple navigation
StereoTwo lenses, mimics human binocular visionDepth perception, 3D mapping, obstacle avoidance
RGB-DRGB color + depth sensing combinedAdvanced navigation, object manipulation (e.g. Intel RealSense)
ThermalDetects infrared radiation / temperatureDark environments, smoke, fog, search and rescue
Event-BasedHigh temporal resolution, low latencyHigh-speed motion detection and tracking

LIDAR Types

TypeCharacteristicsUse Case
2D LIDARSimpler, cheaper, fasterBasic planar navigation, obstacle detection
3D LIDARDetailed spatial info, wide field of viewAutonomous vehicle mapping
Solid-State / Flash / MEMSDurable, cost-effectiveProduction vehicles, consumer robotics

Other Common Sensors

SensorWhat It Does
UltrasonicSound waves measure distance (used in vacuum robots)
RADARDistance, speed, and size — reliable in all conditions
GPSOutdoor localization
IMUAcceleration and rotation tracking
Wheel EncodersMeasure wheel rotation for distance traveled
Magnetic / CompassEarth’s magnetic field for heading
Touch SensorsContact detection (household robots)

Part 3: How Tesla and Waymo Actually See the Road

Tesla FSD — Vision Only

Tesla uses 8 monocular cameras arranged for 360° coverage:

                [Forward Main]
             [Forward Narrow]

[Pillar Left]                    [Pillar Right]
[Repeater Left]                  [Repeater Right]

                [Rear]
                [Forward Wide]
CameraResolutionRangePurpose
Forward Main1.2 MP150mPrimary forward vision
Forward Narrow1.2 MP250mLong range, highway reading
Forward Wide1.2 MP60mClose range, intersections
Pillar (x2)1.2 MP100mTraffic lights, crossing objects
Repeater (x2)1.2 MP100mLane changes, blind spots
Rear1.2 MP50mReversing, rear traffic

All cameras run at 36 fps, processed by Tesla’s custom FSD Computer (144 TOPS). Tesla removed RADAR in 2021 (controversial), then reintroduced 4D imaging radar for some models.

Waymo — Full Multi-Modal Sensor Suite

SensorSpecsPurpose
Roof LIDARCustom spinning, 360°, ~200mPrimary 3D scene understanding
Short-Range LIDAR4x solid-state (corners)Close obstacle detection
Cameras29 total360° vision, color, texture
RADAR6x unitsSpeed measurement, all-weather reliability

Head-to-Head Comparison

DimensionTesla FSDWaymo
Sensor approachCamera only (8 cameras)Camera + LIDAR + RADAR (40+ sensors)
Depth sensingInferred via neural netDirect via LIDAR point clouds
ArchitectureEnd-to-end neural network (FSD v12)Modular (separate perception/planning)
HD mapsNo — builds understanding on the flyYes — pre-mapped areas only
Rain/FogCamera degradesLIDAR + RADAR more robust
Cost per vehicle~$1,000 sensor suite~$100,000+ sensor suite
FleetMillions of consumer vehiclesSmall dedicated test fleet
GeographyAnywhere (no pre-mapping)Only pre-mapped areas

How Tesla Estimates Depth Without LIDAR

Tesla uses four methods simultaneously to figure out how far away the car in front is:

Method 1: Cross-Camera Geometry. Tesla’s 8 cameras are physically separated on the car. Same object seen from slightly different positions gives triangulation — synthetic stereo vision with a wide baseline.

Method 2: Monocular Depth Cues. Even from one camera, the neural network learns cues humans also use:

Visual CueWhat It Tells the Network
Object sizeSmall car = far away
Lane convergenceVanishing point = distance scale
OcclusionCar hidden behind another = closer
Texture densityRoad markings getting smaller = distance
Shadow positionHeight above ground
Position in frameLower in frame = closer

Method 3: Motion Parallax. As Tesla moves, nearby objects shift fast across frames while distant objects barely move. Speed of pixel movement + known car speed = distance. This is Structure from Motion (SfM).

Method 4: Occupancy Network. Tesla’s key innovation (FSD v10+). Eight camera feeds go into a neural network that outputs a 3D voxel grid of occupied vs. free space:

8 camera feeds (2D images)
        ↓
  Neural Network
        ↓
  3D Occupancy Grid

  ■ = occupied (car, person, wall)
  □ = free space (drivable road)

      □□□□□□□□□□□□□□
      □□□■■■□□■■■□□□   ← two cars in 3D grid
      □□□■■■□□■■■□□□
      □□□□□□□□□□□□□□
Tesla →  ←15m→  ←30m→

Why this is controversial: LIDAR measurement is always exact (speed of light, physics-based). Camera + neural net is a statistical estimate that could be wrong in rare cases. The 2016 Tesla fatal crash involved a white truck against a bright sky — poor contrast led to underestimated distance. LIDAR would have detected that truck regardless of color or lighting.

Tesla’s Fleet Learning Feedback Loop

Real car encounters edge case
        ↓
Data uploaded (driver intervention flagged)
        ↓
Added to training set
        ↓
Model retrained
        ↓
OTA update pushed to ~5 million Tesla vehicles
        ↓
All cars now handle that edge case better

No other autonomy company has this scale of real-world feedback.

Part 4: Test It Yourself — Depth Estimation With Your Camera

pip install opencv-python torch torchvision numpy timm ultralytics

Experiment 1: Monocular Depth (MiDaS)

Same concept Tesla uses — a neural network estimates depth from a single camera frame:

import cv2
import torch
import numpy as np

# Load MiDaS model (same type of approach Tesla uses)
model_type = "MiDaS_small"  # fast, runs on CPU
midas = torch.hub.load("intel-isl/MiDaS", model_type)
midas.eval()

transforms = torch.hub.load("intel-isl/MiDaS", "transforms")
transform = transforms.small_transform

cap = cv2.VideoCapture(0)  # your camera

while True:
    ret, frame = cap.read()
    if not ret:
        break

    img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    input_tensor = transform(img_rgb)

    with torch.no_grad():
        depth = midas(input_tensor)
        depth = torch.nn.functional.interpolate(
            depth.unsqueeze(1),
            size=frame.shape[:2],
            mode="bicubic",
            align_corners=False,
        ).squeeze()

    depth_map = depth.numpy()
    depth_normalized = cv2.normalize(
        depth_map, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U
    )
    depth_colored = cv2.applyColorMap(depth_normalized, cv2.COLORMAP_MAGMA)

    combined = np.hstack([frame, depth_colored])
    cv2.imshow("Camera | Depth Map (bright=close, dark=far)", combined)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Wave your hand toward the camera and watch the depth map change in real time.

Experiment 2: Motion Parallax — Depth From Movement

This replicates Tesla’s sequential frame depth method using optical flow:

import cv2
import numpy as np

cap = cv2.VideoCapture(0)

feature_params = dict(maxCorners=100, qualityLevel=0.3,
                      minDistance=7, blockSize=7)
lk_params = dict(winSize=(15, 15), maxLevel=2,
                 criteria=(cv2.TERM_CRITERIA_EPS |
                           cv2.TERM_CRITERIA_COUNT, 10, 0.03))

ret, old_frame = cap.read()
old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)
p0 = cv2.goodFeaturesToTrack(old_gray, mask=None, **feature_params)

while True:
    ret, frame = cap.read()
    frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    p1, st, err = cv2.calcOpticalFlowPyrLK(
        old_gray, frame_gray, p0, None, **lk_params
    )

    good_new = p1[st == 1]
    good_old = p0[st == 1]

    for new, old in zip(good_new, good_old):
        a, b = new.ravel().astype(int)
        c, d = old.ravel().astype(int)

        motion = np.sqrt((a - c)**2 + (b - d)**2)

        if motion > 5:
            color = (0, 0, 255)     # red = close
            label = f"CLOSE ({motion:.1f}px)"
        elif motion > 2:
            color = (0, 165, 255)   # orange = medium
            label = f"MED ({motion:.1f}px)"
        else:
            color = (0, 255, 0)     # green = far
            label = f"FAR ({motion:.1f}px)"

        cv2.circle(frame, (a, b), 5, color, -1)
        cv2.putText(frame, label, (a+5, b),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.4, color, 1)

    cv2.imshow("Motion Parallax (RED=close, GREEN=far)", frame)

    old_gray = frame_gray.copy()
    p0 = good_new.reshape(-1, 1, 2)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Move your hand toward the camera — tracked points turn red. Background stays green.

Experiment 3: Object Detection + Size-Based Distance

This replicates Tesla’s “known object size → distance” method:

import cv2
import numpy as np
from ultralytics import YOLO

model = YOLO("yolov8n.pt")  # downloads automatically

# Known real-world widths in meters
KNOWN_WIDTHS = {
    "person": 0.5, "car": 2.0, "truck": 2.5, "bicycle": 0.6,
    "bottle": 0.07, "cell phone": 0.07, "laptop": 0.35, "chair": 0.5,
}

FOCAL_LENGTH = 700  # pixels — calibrate for your camera

cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    results = model(frame, verbose=False)[0]

    for box in results.boxes:
        cls_name = model.names[int(box.cls)]
        x1, y1, x2, y2 = box.xyxy[0].int().tolist()
        confidence = float(box.conf)

        if cls_name in KNOWN_WIDTHS and confidence > 0.5:
            pixel_width = x2 - x1
            real_width = KNOWN_WIDTHS[cls_name]

            # Distance = (real_width * focal_length) / pixel_width
            distance = (real_width * FOCAL_LENGTH) / pixel_width

            label = f"{cls_name}: {distance:.1f}m"
            color = (0, 0, 255) if distance < 1.5 else (0, 255, 0)

            cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)
            cv2.putText(frame, label, (x1, y1 - 10),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)

    cv2.imshow("Object Distance Estimation", frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Hold a bottle or phone at known distances (1m, 2m) and compare the estimate. To calibrate your focal length:

known_distance = 1.0     # meters
known_width_real = 0.07  # meters (e.g. a phone)
known_width_pixels = 55  # measure from your output frame

focal_length = (known_width_pixels * known_distance) / known_width_real
print(f"Your focal length: {focal_length:.1f} pixels")

Experiment 4: Combined System (Mini Tesla-Style)

Depth map + object detection running together:

import cv2
import torch
import numpy as np
from ultralytics import YOLO

midas = torch.hub.load("intel-isl/MiDaS", "MiDaS_small")
midas.eval()
transform = torch.hub.load("intel-isl/MiDaS", "transforms").small_transform
yolo = YOLO("yolov8n.pt")

cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Depth map
    img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    with torch.no_grad():
        depth = midas(transform(img_rgb))
        depth = torch.nn.functional.interpolate(
            depth.unsqueeze(1), size=frame.shape[:2],
            mode="bicubic", align_corners=False).squeeze().numpy()

    depth_norm = cv2.normalize(depth, None, 0, 255,
                               cv2.NORM_MINMAX, cv2.CV_8U)
    depth_colored = cv2.applyColorMap(depth_norm, cv2.COLORMAP_MAGMA)

    # Object detection
    results = yolo(frame, verbose=False)[0]
    for box in results.boxes:
        x1, y1, x2, y2 = box.xyxy[0].int().tolist()
        cls_name = yolo.names[int(box.cls)]

        # Sample depth at center of bounding box
        cx, cy = (x1 + x2) // 2, (y1 + y2) // 2
        relative_depth = depth[cy, cx]

        label = f"{cls_name} d={relative_depth:.0f}"
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, label, (x1, y1 - 8),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    combined = np.hstack([frame, depth_colored])
    cv2.imshow("Detection + Depth (Tesla-style)", combined)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()
ExperimentWhat It TeachesTesla Equivalent
1 — MiDaSNeural nets estimate depth from one imageOccupancy network concept
2 — Optical FlowMotion between frames reveals depthSequential frame analysis
3 — YOLO + SizeKnown object sizes give absolute distanceObject-size prior
4 — CombinedFuse estimates into one systemHow Tesla fuses all methods

Part 5: PID Control — The 80-Year-Old Algorithm Running 95% of Robots

PID stands for Proportional, Integral, Derivative — three terms that work together to minimize the error between where you are and where you want to be.

The Core Idea

You want to be at position 100
You are currently at position 70
Error = 100 - 70 = 30

P → How big is the error RIGHT NOW?
I → Has there been a persistent error OVER TIME?
D → Is the error GROWING or SHRINKING?

Real-World Analogy: Maintaining Speed

Current speed: 45 mph    Target: 60 mph    Error: 15 mph

P (Proportional):  Big error → press gas hard
I (Integral):      Been slow for a while → press even more
D (Derivative):    Speed increasing fast → ease off slightly
                   (prevent overshooting to 75 mph)

Each Term Explained

P — Proportional (react to current error):

Output = Kp × error

Error=30, Kp=0.5  →  Output=15 (strong correction)
Error=10, Kp=0.5  →  Output=5  (medium correction)
Error=2,  Kp=0.5  →  Output=1  (small correction)

Problem: P alone always leaves a small remaining error (steady state error). It needs help.

I — Integral (react to accumulated error over time):

Output = Ki × sum of all past errors × time

Time:    0   1   2   3   4   5  seconds
Error:  10  10  10  10  10  10  (stuck 10° below target)

Integral keeps growing:
  t=1: 10
  t=2: 20
  t=3: 30  ← output keeps increasing until error is gone

Without I, temperature sits at 99°C forever. With I, the integral grows, pushes harder, eventually reaches 100°C exactly. But if Ki is too large, the system overshoots and oscillates wildly.

D — Derivative (react to rate of change):

Output = Kd × (change in error / change in time)

Error shrinking fast:  derivative negative → reduce output (don’t overshoot)
Error growing fast:    derivative positive → increase output urgently
Error stable:          derivative zero     → no D contribution

D is anticipation — it predicts where the system is heading and applies braking force for a smooth arrival.

The Full PID Formula

Output = Kp×error + Ki×∫error dt + Kd×(d error/dt)

Visual Intuition

Well-tuned PID:
Target ────────────────────  100
                 _____/
            ____/
  _________/
Actual

Too little D (oscillates):
Target ────────────────────  100
         /\        /\
        /  \      /  \      /\
─────/    \    /    \    /  \______
             \  /      \  /
              \/        \/

Kp too small (never reaches):
Target ────────────────────  100
                         ________  95
          ______________/
─────────

Full Simulation Code

import time
import matplotlib.pyplot as plt

class PID:
    def __init__(self, kp, ki, kd, setpoint):
        self.kp, self.ki, self.kd = kp, ki, kd
        self.setpoint = setpoint
        self.prev_error = 0
        self.integral = 0
        self.last_time = time.time()

    def compute(self, actual_value):
        current_time = time.time()
        dt = current_time - self.last_time
        if dt <= 0:
            dt = 0.001

        error = self.setpoint - actual_value

        P = self.kp * error
        self.integral += error * dt
        I = self.ki * self.integral
        derivative = (error - self.prev_error) / dt
        D = self.kd * derivative

        output = P + I + D
        self.prev_error = error
        self.last_time = current_time

        return output, P, I, D, error


class SimulatedSystem:
    def __init__(self, initial_value=0):
        self.value = initial_value
        self.velocity = 0

    def update(self, control_output, dt=0.05):
        self.velocity += control_output * dt * 0.5
        self.velocity *= 0.95  # friction/damping
        self.value += self.velocity * dt
        return self.value


# Run simulation
setpoint = 100.0
pid = PID(kp=2.0, ki=0.5, kd=1.0, setpoint=setpoint)
system = SimulatedSystem(initial_value=0)

times, actuals, targets = [], [], []
p_terms, i_terms, d_terms = [], [], []

t = 0
for step in range(100):
    actual = system.value
    output, P, I, D, error = pid.compute(actual)
    output = max(-50, min(50, output))  # clamp like real actuators
    system.update(output)

    times.append(t)
    actuals.append(actual)
    targets.append(setpoint)
    p_terms.append(P)
    i_terms.append(I)
    d_terms.append(D)

    t += 0.05
    time.sleep(0.01)

# Plot results
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))

ax1.plot(times, actuals, 'b-', label='Actual', linewidth=2)
ax1.plot(times, targets, 'r--', label='Target', linewidth=2)
ax1.set_ylabel('Value')
ax1.set_title('PID Control — System Response')
ax1.legend()
ax1.grid(True)

ax2.plot(times, p_terms, 'r-', label='P term', linewidth=1.5)
ax2.plot(times, i_terms, 'g-', label='I term', linewidth=1.5)
ax2.plot(times, d_terms, 'b-', label='D term', linewidth=1.5)
ax2.set_xlabel('Time (seconds)')
ax2.set_ylabel('Contribution')
ax2.set_title('PID — Individual Term Contributions')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.savefig('pid_response.png', dpi=150)
plt.show()

Try changing the gains and re-running:

ConfigurationKpKiKdWhat Happens
Well tuned2.00.51.0Smooth response
Aggressive P8.00.50.1Oscillates
No D2.00.50.0Overshoots
Too much I2.05.01.0Slow windup, overshoots
P only1.00.00.0Steady state error

PID + Camera: Object Tracking (Tesla Lane Centering)

import cv2
from ultralytics import YOLO

class PID:
    def __init__(self, kp, ki, kd):
        self.kp, self.ki, self.kd = kp, ki, kd
        self.prev_error = 0
        self.integral = 0

    def compute(self, error, dt=0.033):
        self.integral += error * dt
        derivative = (error - self.prev_error) / dt
        output = self.kp * error + self.ki * self.integral + self.kd * derivative
        self.prev_error = error
        return output

model = YOLO("yolov8n.pt")
pid = PID(kp=0.5, ki=0.01, kd=0.1)

cap = cv2.VideoCapture(0)
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
center_x = frame_width // 2

while True:
    ret, frame = cap.read()
    if not ret:
        break

    results = model(frame, verbose=False)[0]

    for box in results.boxes:
        if model.names[int(box.cls)] == "person":
            x1, y1, x2, y2 = box.xyxy[0].int().tolist()
            obj_center_x = (x1 + x2) // 2

            error = obj_center_x - center_x
            correction = pid.compute(error)

            cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
            cv2.line(frame, (center_x, 0),
                     (center_x, frame.shape[0]), (255, 0, 0), 2)
            cv2.line(frame, (obj_center_x, 0),
                     (obj_center_x, frame.shape[0]), (0, 0, 255), 1)

            direction = "LEFT" if correction > 0 else "RIGHT"
            cv2.putText(frame, f"Error: {error}px", (10, 30),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)
            cv2.putText(frame, f"PID: {correction:.1f}", (10, 60),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)
            cv2.putText(frame, f"Steer: {direction}", (10, 90),
                        cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)

    cv2.imshow("PID Object Tracking (Tesla lane centering)", frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

This simulates the same logic Tesla uses for lane centering — measure error from desired position, apply PID correction, continuously minimize error.

Where Each Term Dominates in Tesla

SituationDominant Term
Car drifting from lane centerP (big error, correct now)
Persistent crosswindI (keeps pushing off → I builds up)
Approaching sharp curveD (error changing fast → anticipate)
Sudden obstacleP + D (large error + fast change)

Tesla’s Full Control Stack

Neural Network (FSD)         ← high-level decisions
        ↓
MPC (Model Predictive Control) ← trajectory tracking
        ↓
PID Controllers               ← actuator commands
        ↓
Steering Motor / Throttle / Brakes

In FSD v12, the neural network increasingly replaces explicit MPC and PID with learned control — the planning and control steps are implicit inside the network weights. But PID still handles the lowest-level actuator precision.

Part 6: Why Robots Need Foundation Models

Traditional Robot:           Foundation Model Robot:
  "pick red ball"       vs     "grab the thing that
  (hardcoded)                   rolls and bounces"
  breaks if ball                works on any new object
  changes color                 it has never seen before

1. Robot Policy Learning

The old way: engineers write thousands of IF/THEN rules for every scenario. The problem: infinite edge cases (wet objects slip, fragile objects break, irregular shapes have no rule). Foundation models watch human demonstrations and learn generalizable policies that work on objects never seen in training.

2. Value Learning

Instead of fixed reward functions that get hacked (robot flips the box upside down to “complete” the task), value learning assigns cost to states. Low cost: object grasped securely, smooth motion. High cost: object falling, jerky motion, near collision. The robot chooses the lowest-cost path through state space.

3. High-Level Task Planning

Traditional: pre-program every sub-step of “make coffee” — fails if the mug is in a different location or someone says “brew me a cup” instead. LLMs understand natural language, generate plans automatically, and each step gets sent to lower-level controllers.

4. LLM Code Generation

New task needed? The old way: hire an engineer, weeks of coding. With LLM code generation (e.g. GenSim), describe the task in English and get simulation environment, reward function, training scenarios, and test cases — minutes instead of months.

5. Open Vocabulary Object Detection

Traditional detectors only recognize categories seen during training. A hospital robot sees an IV drip stand and outputs UNKNOWN. Open vocabulary models (like DINO) understand any text description: “the thing that holds the fluid bag on a pole” → detects IV drip stand correctly, never seen it during training.

6. Semantic Segmentation

Object detection draws a box. But a robot needs exact pixel boundaries to know where to grip:

Bounding box:   ████████   ← robot grips the air next to the cup
                ████████

Segmentation:     ░░░░░░
                ░░██████░░
                ░░██████──  ← handle identified separately
                ░░██████░░
                  ░░░░░░

Robot knows: grip the body (not the handle, too fragile), handle is an alternative grip point, bottom is a stable placement point.

7. 3D Representation and Affordance

Affordance = what can you DO with an object. Traditional robots see a cup and know it IS a cup, but don’t know which end to drink from, where the handle is grippable, or whether it will tip. Foundation models infer affordances from shape geometry, material properties, and physical common sense learned from internet data.

How It All Connects

Human: "put the fragile glass on the high shelf carefully"
        ↓
  LLM Task Planning
  "fragile" → gentle grip    "high shelf" → extend arm    "carefully" → slow
        ↓
  Open Vocab Detection
  Finds "glass" in scene — never seen this exact glass before
        ↓
  Semantic Segmentation
  Exact shape: rim, body, base locations identified
        ↓
  3D Affordance
  Best grip point, force needed, keep upright
        ↓
  Value Learning
  Lowest cost path = smooth, slow arc upward
        ↓
  Diffusion Policy
  Smooth grasping motion learned from human demos
        ↓
  PID / MPC
  Precise actuator control, maintains grip force
        ↓
  Glass placed safely on shelf

Summary

ProblemSolution
Can’t write rules for every scenarioPolicy learning from demonstrations
Fixed reward functions get hackedValue learning
Can’t pre-program every taskLLM task planning + code generation
Can’t recognize every objectOpen vocabulary detection
Bounding boxes aren’t precise enoughSemantic segmentation
Don’t know how to interact with objects3D affordance models

Part 7: RAG for Robots — How Semantic Memory Enables Navigation

Traditional RAG:           Robotics RAG:
  Documents → vectors       Camera frames → vectors
  "find relevant text"       "find relevant memory"
  answer questions           navigate and act in world

Everything a robot sees becomes a “document” stored in vector space.

How Robot Memory Gets Built

t=0s:   Camera captures kitchen
        VLM captions: "kitchen counter, chips bag on left shelf,
                       microwave on right, blue mug near sink"
        → embed caption → store vector + timestamp + location

t=30s:  Camera captures window area
        VLM captions: "large window, green plants, city view"
        → embed → store

t=60s:  Camera captures conference room
        VLM captions: "meeting table, whiteboard, projector"
        → embed → store

... robot builds semantic map of entire environment

How RAG Retrieval Works for Navigation

Human says: "get me the chips"
        ↓
  Speech → Text (Whisper)
        ↓
  Embed the query: "chips" → [0.23, 0.87, 0.12, ...]
        ↓
  Vector similarity search:
    Score 0.94: "kitchen counter, chips bag on left shelf"  ← MATCH
    Score 0.71: "break room, snack cabinet, various foods"
    Score 0.43: "office desk, papers, coffee cup"
        ↓
  Top result: kitchen, left shelf
        ↓
  Robot navigates and retrieves chips

The key: “chips” was never stored as an exact keyword. The vector embedding captures semantic meaning, so the query also finds “snack cabinet” and “Doritos on counter” — same concept, different words.

Full Architecture

┌─────────────────────────────────────────────┐
│  Camera (30fps) → VLM captions every 3s        │
│  Microphone → Whisper → text                    │
│  GPS/Odometry → location coordinates            │
└─────────────┬───────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────┐
│  Embed → Vector Database (FAISS/Milvus)         │
│  [vec1, vec2, vec3 ... millions of memories]     │
└─────────────┬───────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────┐
│  Query → embed → similarity search              │
│  Top K memories → inject into LLM context        │
│  LLM reasons: "Go to kitchen, chips on left"     │
└─────────────┬───────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────────────┐
│  Navigation → path to kitchen                   │
│  Manipulation → grasp chips bag                  │
│  Deliver to human                                │
└─────────────────────────────────────────────┘

Code: Build a Robot Memory System

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from datetime import datetime

# pip install faiss-cpu sentence-transformers

class RobotMemory:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.dimension = 384
        self.index = faiss.IndexFlatL2(self.dimension)
        self.metadata = []

    def store_observation(self, caption, location=(0, 0)):
        """Store a robot observation as a memory"""
        vector = self.model.encode([caption])[0]
        self.index.add(np.array([vector], dtype=np.float32))
        self.metadata.append({
            'caption': caption,
            'location': location,
            'timestamp': datetime.now().isoformat(),
            'id': len(self.metadata)
        })
        print(f"Stored memory #{len(self.metadata)-1}: "
              f"'{caption}' at {location}")

    def query(self, question, top_k=3):
        """Find most relevant memories for a question"""
        query_vector = self.model.encode([question])[0]
        distances, indices = self.index.search(
            np.array([query_vector], dtype=np.float32), top_k
        )
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx < len(self.metadata):
                memory = self.metadata[idx].copy()
                memory['similarity_score'] = float(1 / (1 + dist))
                results.append(memory)
        return results

    def decide_action(self, query):
        """RAG: retrieve memories + reason about action"""
        memories = self.query(query, top_k=3)

        print(f"\nQuery: '{query}'")
        print("Relevant memories found:")
        for m in memories:
            print(f"  [{m['similarity_score']:.2f}] "
                  f"'{m['caption']}' at {m['location']}")

        best = memories[0]
        print(f"\nAction: Navigate to {best['location']}")
        print(f"Reason: '{best['caption']}' found there")
        return best['location']


# Simulate robot exploring environment
robot = RobotMemory()

robot.store_observation(
    "kitchen counter with chips bag on left shelf", location=(10, 5))
robot.store_observation(
    "large window with green plants and city view", location=(25, 3))
robot.store_observation(
    "conference room with whiteboard and projector", location=(15, 20))
robot.store_observation(
    "break room with coffee machine and mugs", location=(8, 15))
robot.store_observation(
    "storage cabinet with snacks and beverages", location=(12, 6))
robot.store_observation(
    "office desk with computer monitor and papers", location=(20, 10))

print("\n" + "=" * 50)
print("ROBOT MEMORY SYSTEM READY")
print("=" * 50)

robot.decide_action("get me something to eat")
robot.decide_action("I want to see something beautiful outside")
robot.decide_action("where can I make coffee")
robot.decide_action("find me a place to have a meeting")

Output:

Query: 'get me something to eat'
Relevant memories found:
  [0.82] 'kitchen counter with chips bag on left shelf' at (10, 5)
  [0.79] 'storage cabinet with snacks and beverages' at (12, 6)
  [0.61] 'break room with coffee machine and mugs' at (8, 15)

Action: Navigate to (10, 5)
Reason: 'kitchen counter with chips bag on left shelf' found there

Why RAG Beats Traditional Navigation

Traditional Robot NavigationRAG-Based Robot Memory
Needs explicit map: “chips at (10,5)”Builds map automatically from observations
Breaks if object movesUpdates when robot re-explores
Can’t understand “something to eat”Understands any natural language query
New objects require reprogrammingGeneralizes to new objects and queries
No memory across sessionsMemory persists indefinitely

This is exactly what NVIDIA demonstrated — a robot that remembers what it has seen and navigates based on semantic memory, not hardcoded maps.

This is The Complete Beginner’s Guide to Robotics — From Your First Camera to Foundation Models by Akshay Parkhi, posted on 10th March 2026.

Next: How Everything Connects — NVIDIA's Cosmos Pipeline from Simulation to Real-World Robots

Previous: From Webcam to Robot Brain: How Vision-Language Models and Vision-Language-Action Models Actually Work