The Complete Beginner’s Guide to Robotics — From Your First Camera to Foundation Models
10th March 2026
This is everything you need to go from zero to building, sensing, and coding robots at home. We’ll start with what hardware to buy, walk through how Tesla and Waymo actually see the road, write real depth estimation and PID control code you can run with just a camera, and end with how foundation models and RAG are transforming what robots can do.
Part 1: Building Your Home Robotics Lab
If you have a camera, you already have the most useful sensor in robotics. Here’s what else you need.
Controllers
| Controller | Price | Best For | Key Feature |
|---|---|---|---|
| Raspberry Pi 4/5 | $35–80 | All-around beginner | Runs Linux, ROS 2, GPIO pins |
| NVIDIA Jetson Nano/Orin Nano | $100–200 | AI and vision workloads | GPU for neural networks on camera input |
| Arduino Uno/Mega | $25–45 | Actuator control | No OS, pairs well with Raspberry Pi |
Actuators
| Type | Price | Use Case |
|---|---|---|
| SG90 Micro Servo | $2–5 | Small robot arms, precise 0°–180° angles |
| MG996R Servo | $8–12 | Larger joints, more torque |
| DC Motor + L298N Driver | ~$5 | Wheels and drivetrain |
| NEMA 17 Stepper | $10–15 | Precise position control (3D printer style) |
Starter Kits
| Kit | Price | What You Get |
|---|---|---|
| Elegoo Smart Robot Car | ~$50 | Wheeled robot with sensors, motors, controller |
| Freenove Big Hexapod Robot | ~$80 | 6-legged walking robot for Raspberry Pi |
What You Can Do Right Now With Just a Camera
Camera → Raspberry Pi / your laptop → ROS 2
# Install ROS 2 and test your camera immediately
sudo apt install ros-humble-usb-cam
ros2 run usb_cam usb_cam_node_exe
# View the stream
ros2 run rqt_image_view rqt_image_view
Even before hardware arrives, you can run object detection (YOLOv8), pose estimation, AprilTag tracking, and depth estimation — all from a laptop camera.
Recommended Learning Path
- Camera + laptop — Learn ROS 2, computer vision, object detection
- Add Raspberry Pi (~$50 total) — Learn GPIO, Linux, hardware communication
- Add servos + motors (~$30 more) — Build a robot that reacts to what it sees
Part 2: Sensors — How Robots See the World
Camera Types
| Type | How It Works | Best For |
|---|---|---|
| Monocular | Single lens, 2D images | Basic detection, image recognition, simple navigation |
| Stereo | Two lenses, mimics human binocular vision | Depth perception, 3D mapping, obstacle avoidance |
| RGB-D | RGB color + depth sensing combined | Advanced navigation, object manipulation (e.g. Intel RealSense) |
| Thermal | Detects infrared radiation / temperature | Dark environments, smoke, fog, search and rescue |
| Event-Based | High temporal resolution, low latency | High-speed motion detection and tracking |
LIDAR Types
| Type | Characteristics | Use Case |
|---|---|---|
| 2D LIDAR | Simpler, cheaper, faster | Basic planar navigation, obstacle detection |
| 3D LIDAR | Detailed spatial info, wide field of view | Autonomous vehicle mapping |
| Solid-State / Flash / MEMS | Durable, cost-effective | Production vehicles, consumer robotics |
Other Common Sensors
| Sensor | What It Does |
|---|---|
| Ultrasonic | Sound waves measure distance (used in vacuum robots) |
| RADAR | Distance, speed, and size — reliable in all conditions |
| GPS | Outdoor localization |
| IMU | Acceleration and rotation tracking |
| Wheel Encoders | Measure wheel rotation for distance traveled |
| Magnetic / Compass | Earth’s magnetic field for heading |
| Touch Sensors | Contact detection (household robots) |
Part 3: How Tesla and Waymo Actually See the Road
Tesla FSD — Vision Only
Tesla uses 8 monocular cameras arranged for 360° coverage:
[Forward Main]
[Forward Narrow]
[Pillar Left] [Pillar Right]
[Repeater Left] [Repeater Right]
[Rear]
[Forward Wide]
| Camera | Resolution | Range | Purpose |
|---|---|---|---|
| Forward Main | 1.2 MP | 150m | Primary forward vision |
| Forward Narrow | 1.2 MP | 250m | Long range, highway reading |
| Forward Wide | 1.2 MP | 60m | Close range, intersections |
| Pillar (x2) | 1.2 MP | 100m | Traffic lights, crossing objects |
| Repeater (x2) | 1.2 MP | 100m | Lane changes, blind spots |
| Rear | 1.2 MP | 50m | Reversing, rear traffic |
All cameras run at 36 fps, processed by Tesla’s custom FSD Computer (144 TOPS). Tesla removed RADAR in 2021 (controversial), then reintroduced 4D imaging radar for some models.
Waymo — Full Multi-Modal Sensor Suite
| Sensor | Specs | Purpose |
|---|---|---|
| Roof LIDAR | Custom spinning, 360°, ~200m | Primary 3D scene understanding |
| Short-Range LIDAR | 4x solid-state (corners) | Close obstacle detection |
| Cameras | 29 total | 360° vision, color, texture |
| RADAR | 6x units | Speed measurement, all-weather reliability |
Head-to-Head Comparison
| Dimension | Tesla FSD | Waymo |
|---|---|---|
| Sensor approach | Camera only (8 cameras) | Camera + LIDAR + RADAR (40+ sensors) |
| Depth sensing | Inferred via neural net | Direct via LIDAR point clouds |
| Architecture | End-to-end neural network (FSD v12) | Modular (separate perception/planning) |
| HD maps | No — builds understanding on the fly | Yes — pre-mapped areas only |
| Rain/Fog | Camera degrades | LIDAR + RADAR more robust |
| Cost per vehicle | ~$1,000 sensor suite | ~$100,000+ sensor suite |
| Fleet | Millions of consumer vehicles | Small dedicated test fleet |
| Geography | Anywhere (no pre-mapping) | Only pre-mapped areas |
How Tesla Estimates Depth Without LIDAR
Tesla uses four methods simultaneously to figure out how far away the car in front is:
Method 1: Cross-Camera Geometry. Tesla’s 8 cameras are physically separated on the car. Same object seen from slightly different positions gives triangulation — synthetic stereo vision with a wide baseline.
Method 2: Monocular Depth Cues. Even from one camera, the neural network learns cues humans also use:
| Visual Cue | What It Tells the Network |
|---|---|
| Object size | Small car = far away |
| Lane convergence | Vanishing point = distance scale |
| Occlusion | Car hidden behind another = closer |
| Texture density | Road markings getting smaller = distance |
| Shadow position | Height above ground |
| Position in frame | Lower in frame = closer |
Method 3: Motion Parallax. As Tesla moves, nearby objects shift fast across frames while distant objects barely move. Speed of pixel movement + known car speed = distance. This is Structure from Motion (SfM).
Method 4: Occupancy Network. Tesla’s key innovation (FSD v10+). Eight camera feeds go into a neural network that outputs a 3D voxel grid of occupied vs. free space:
8 camera feeds (2D images)
↓
Neural Network
↓
3D Occupancy Grid
■ = occupied (car, person, wall)
□ = free space (drivable road)
□□□□□□□□□□□□□□
□□□■■■□□■■■□□□ ← two cars in 3D grid
□□□■■■□□■■■□□□
□□□□□□□□□□□□□□
Tesla → ←15m→ ←30m→
Why this is controversial: LIDAR measurement is always exact (speed of light, physics-based). Camera + neural net is a statistical estimate that could be wrong in rare cases. The 2016 Tesla fatal crash involved a white truck against a bright sky — poor contrast led to underestimated distance. LIDAR would have detected that truck regardless of color or lighting.
Tesla’s Fleet Learning Feedback Loop
Real car encounters edge case
↓
Data uploaded (driver intervention flagged)
↓
Added to training set
↓
Model retrained
↓
OTA update pushed to ~5 million Tesla vehicles
↓
All cars now handle that edge case better
No other autonomy company has this scale of real-world feedback.
Part 4: Test It Yourself — Depth Estimation With Your Camera
pip install opencv-python torch torchvision numpy timm ultralytics
Experiment 1: Monocular Depth (MiDaS)
Same concept Tesla uses — a neural network estimates depth from a single camera frame:
import cv2
import torch
import numpy as np
# Load MiDaS model (same type of approach Tesla uses)
model_type = "MiDaS_small" # fast, runs on CPU
midas = torch.hub.load("intel-isl/MiDaS", model_type)
midas.eval()
transforms = torch.hub.load("intel-isl/MiDaS", "transforms")
transform = transforms.small_transform
cap = cv2.VideoCapture(0) # your camera
while True:
ret, frame = cap.read()
if not ret:
break
img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
input_tensor = transform(img_rgb)
with torch.no_grad():
depth = midas(input_tensor)
depth = torch.nn.functional.interpolate(
depth.unsqueeze(1),
size=frame.shape[:2],
mode="bicubic",
align_corners=False,
).squeeze()
depth_map = depth.numpy()
depth_normalized = cv2.normalize(
depth_map, None, 0, 255, cv2.NORM_MINMAX, cv2.CV_8U
)
depth_colored = cv2.applyColorMap(depth_normalized, cv2.COLORMAP_MAGMA)
combined = np.hstack([frame, depth_colored])
cv2.imshow("Camera | Depth Map (bright=close, dark=far)", combined)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
Wave your hand toward the camera and watch the depth map change in real time.
Experiment 2: Motion Parallax — Depth From Movement
This replicates Tesla’s sequential frame depth method using optical flow:
import cv2
import numpy as np
cap = cv2.VideoCapture(0)
feature_params = dict(maxCorners=100, qualityLevel=0.3,
minDistance=7, blockSize=7)
lk_params = dict(winSize=(15, 15), maxLevel=2,
criteria=(cv2.TERM_CRITERIA_EPS |
cv2.TERM_CRITERIA_COUNT, 10, 0.03))
ret, old_frame = cap.read()
old_gray = cv2.cvtColor(old_frame, cv2.COLOR_BGR2GRAY)
p0 = cv2.goodFeaturesToTrack(old_gray, mask=None, **feature_params)
while True:
ret, frame = cap.read()
frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
p1, st, err = cv2.calcOpticalFlowPyrLK(
old_gray, frame_gray, p0, None, **lk_params
)
good_new = p1[st == 1]
good_old = p0[st == 1]
for new, old in zip(good_new, good_old):
a, b = new.ravel().astype(int)
c, d = old.ravel().astype(int)
motion = np.sqrt((a - c)**2 + (b - d)**2)
if motion > 5:
color = (0, 0, 255) # red = close
label = f"CLOSE ({motion:.1f}px)"
elif motion > 2:
color = (0, 165, 255) # orange = medium
label = f"MED ({motion:.1f}px)"
else:
color = (0, 255, 0) # green = far
label = f"FAR ({motion:.1f}px)"
cv2.circle(frame, (a, b), 5, color, -1)
cv2.putText(frame, label, (a+5, b),
cv2.FONT_HERSHEY_SIMPLEX, 0.4, color, 1)
cv2.imshow("Motion Parallax (RED=close, GREEN=far)", frame)
old_gray = frame_gray.copy()
p0 = good_new.reshape(-1, 1, 2)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
Move your hand toward the camera — tracked points turn red. Background stays green.
Experiment 3: Object Detection + Size-Based Distance
This replicates Tesla’s “known object size → distance” method:
import cv2
import numpy as np
from ultralytics import YOLO
model = YOLO("yolov8n.pt") # downloads automatically
# Known real-world widths in meters
KNOWN_WIDTHS = {
"person": 0.5, "car": 2.0, "truck": 2.5, "bicycle": 0.6,
"bottle": 0.07, "cell phone": 0.07, "laptop": 0.35, "chair": 0.5,
}
FOCAL_LENGTH = 700 # pixels — calibrate for your camera
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
if not ret:
break
results = model(frame, verbose=False)[0]
for box in results.boxes:
cls_name = model.names[int(box.cls)]
x1, y1, x2, y2 = box.xyxy[0].int().tolist()
confidence = float(box.conf)
if cls_name in KNOWN_WIDTHS and confidence > 0.5:
pixel_width = x2 - x1
real_width = KNOWN_WIDTHS[cls_name]
# Distance = (real_width * focal_length) / pixel_width
distance = (real_width * FOCAL_LENGTH) / pixel_width
label = f"{cls_name}: {distance:.1f}m"
color = (0, 0, 255) if distance < 1.5 else (0, 255, 0)
cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)
cv2.putText(frame, label, (x1, y1 - 10),
cv2.FONT_HERSHEY_SIMPLEX, 0.6, color, 2)
cv2.imshow("Object Distance Estimation", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
Hold a bottle or phone at known distances (1m, 2m) and compare the estimate. To calibrate your focal length:
known_distance = 1.0 # meters
known_width_real = 0.07 # meters (e.g. a phone)
known_width_pixels = 55 # measure from your output frame
focal_length = (known_width_pixels * known_distance) / known_width_real
print(f"Your focal length: {focal_length:.1f} pixels")
Experiment 4: Combined System (Mini Tesla-Style)
Depth map + object detection running together:
import cv2
import torch
import numpy as np
from ultralytics import YOLO
midas = torch.hub.load("intel-isl/MiDaS", "MiDaS_small")
midas.eval()
transform = torch.hub.load("intel-isl/MiDaS", "transforms").small_transform
yolo = YOLO("yolov8n.pt")
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
if not ret:
break
# Depth map
img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
with torch.no_grad():
depth = midas(transform(img_rgb))
depth = torch.nn.functional.interpolate(
depth.unsqueeze(1), size=frame.shape[:2],
mode="bicubic", align_corners=False).squeeze().numpy()
depth_norm = cv2.normalize(depth, None, 0, 255,
cv2.NORM_MINMAX, cv2.CV_8U)
depth_colored = cv2.applyColorMap(depth_norm, cv2.COLORMAP_MAGMA)
# Object detection
results = yolo(frame, verbose=False)[0]
for box in results.boxes:
x1, y1, x2, y2 = box.xyxy[0].int().tolist()
cls_name = yolo.names[int(box.cls)]
# Sample depth at center of bounding box
cx, cy = (x1 + x2) // 2, (y1 + y2) // 2
relative_depth = depth[cy, cx]
label = f"{cls_name} d={relative_depth:.0f}"
cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
cv2.putText(frame, label, (x1, y1 - 8),
cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
combined = np.hstack([frame, depth_colored])
cv2.imshow("Detection + Depth (Tesla-style)", combined)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
| Experiment | What It Teaches | Tesla Equivalent |
|---|---|---|
| 1 — MiDaS | Neural nets estimate depth from one image | Occupancy network concept |
| 2 — Optical Flow | Motion between frames reveals depth | Sequential frame analysis |
| 3 — YOLO + Size | Known object sizes give absolute distance | Object-size prior |
| 4 — Combined | Fuse estimates into one system | How Tesla fuses all methods |
Part 5: PID Control — The 80-Year-Old Algorithm Running 95% of Robots
PID stands for Proportional, Integral, Derivative — three terms that work together to minimize the error between where you are and where you want to be.
The Core Idea
You want to be at position 100
You are currently at position 70
Error = 100 - 70 = 30
P → How big is the error RIGHT NOW?
I → Has there been a persistent error OVER TIME?
D → Is the error GROWING or SHRINKING?
Real-World Analogy: Maintaining Speed
Current speed: 45 mph Target: 60 mph Error: 15 mph
P (Proportional): Big error → press gas hard
I (Integral): Been slow for a while → press even more
D (Derivative): Speed increasing fast → ease off slightly
(prevent overshooting to 75 mph)
Each Term Explained
P — Proportional (react to current error):
Output = Kp × error
Error=30, Kp=0.5 → Output=15 (strong correction)
Error=10, Kp=0.5 → Output=5 (medium correction)
Error=2, Kp=0.5 → Output=1 (small correction)
Problem: P alone always leaves a small remaining error (steady state error). It needs help.
I — Integral (react to accumulated error over time):
Output = Ki × sum of all past errors × time
Time: 0 1 2 3 4 5 seconds
Error: 10 10 10 10 10 10 (stuck 10° below target)
Integral keeps growing:
t=1: 10
t=2: 20
t=3: 30 ← output keeps increasing until error is gone
Without I, temperature sits at 99°C forever. With I, the integral grows, pushes harder, eventually reaches 100°C exactly. But if Ki is too large, the system overshoots and oscillates wildly.
D — Derivative (react to rate of change):
Output = Kd × (change in error / change in time)
Error shrinking fast: derivative negative → reduce output (don’t overshoot)
Error growing fast: derivative positive → increase output urgently
Error stable: derivative zero → no D contribution
D is anticipation — it predicts where the system is heading and applies braking force for a smooth arrival.
The Full PID Formula
Output = Kp×error + Ki×∫error dt + Kd×(d error/dt)
Visual Intuition
Well-tuned PID:
Target ──────────────────── 100
_____/
____/
_________/
Actual
Too little D (oscillates):
Target ──────────────────── 100
/\ /\
/ \ / \ /\
─────/ \ / \ / \______
\ / \ /
\/ \/
Kp too small (never reaches):
Target ──────────────────── 100
________ 95
______________/
─────────
Full Simulation Code
import time
import matplotlib.pyplot as plt
class PID:
def __init__(self, kp, ki, kd, setpoint):
self.kp, self.ki, self.kd = kp, ki, kd
self.setpoint = setpoint
self.prev_error = 0
self.integral = 0
self.last_time = time.time()
def compute(self, actual_value):
current_time = time.time()
dt = current_time - self.last_time
if dt <= 0:
dt = 0.001
error = self.setpoint - actual_value
P = self.kp * error
self.integral += error * dt
I = self.ki * self.integral
derivative = (error - self.prev_error) / dt
D = self.kd * derivative
output = P + I + D
self.prev_error = error
self.last_time = current_time
return output, P, I, D, error
class SimulatedSystem:
def __init__(self, initial_value=0):
self.value = initial_value
self.velocity = 0
def update(self, control_output, dt=0.05):
self.velocity += control_output * dt * 0.5
self.velocity *= 0.95 # friction/damping
self.value += self.velocity * dt
return self.value
# Run simulation
setpoint = 100.0
pid = PID(kp=2.0, ki=0.5, kd=1.0, setpoint=setpoint)
system = SimulatedSystem(initial_value=0)
times, actuals, targets = [], [], []
p_terms, i_terms, d_terms = [], [], []
t = 0
for step in range(100):
actual = system.value
output, P, I, D, error = pid.compute(actual)
output = max(-50, min(50, output)) # clamp like real actuators
system.update(output)
times.append(t)
actuals.append(actual)
targets.append(setpoint)
p_terms.append(P)
i_terms.append(I)
d_terms.append(D)
t += 0.05
time.sleep(0.01)
# Plot results
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))
ax1.plot(times, actuals, 'b-', label='Actual', linewidth=2)
ax1.plot(times, targets, 'r--', label='Target', linewidth=2)
ax1.set_ylabel('Value')
ax1.set_title('PID Control — System Response')
ax1.legend()
ax1.grid(True)
ax2.plot(times, p_terms, 'r-', label='P term', linewidth=1.5)
ax2.plot(times, i_terms, 'g-', label='I term', linewidth=1.5)
ax2.plot(times, d_terms, 'b-', label='D term', linewidth=1.5)
ax2.set_xlabel('Time (seconds)')
ax2.set_ylabel('Contribution')
ax2.set_title('PID — Individual Term Contributions')
ax2.legend()
ax2.grid(True)
plt.tight_layout()
plt.savefig('pid_response.png', dpi=150)
plt.show()
Try changing the gains and re-running:
| Configuration | Kp | Ki | Kd | What Happens |
|---|---|---|---|---|
| Well tuned | 2.0 | 0.5 | 1.0 | Smooth response |
| Aggressive P | 8.0 | 0.5 | 0.1 | Oscillates |
| No D | 2.0 | 0.5 | 0.0 | Overshoots |
| Too much I | 2.0 | 5.0 | 1.0 | Slow windup, overshoots |
| P only | 1.0 | 0.0 | 0.0 | Steady state error |
PID + Camera: Object Tracking (Tesla Lane Centering)
import cv2
from ultralytics import YOLO
class PID:
def __init__(self, kp, ki, kd):
self.kp, self.ki, self.kd = kp, ki, kd
self.prev_error = 0
self.integral = 0
def compute(self, error, dt=0.033):
self.integral += error * dt
derivative = (error - self.prev_error) / dt
output = self.kp * error + self.ki * self.integral + self.kd * derivative
self.prev_error = error
return output
model = YOLO("yolov8n.pt")
pid = PID(kp=0.5, ki=0.01, kd=0.1)
cap = cv2.VideoCapture(0)
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
center_x = frame_width // 2
while True:
ret, frame = cap.read()
if not ret:
break
results = model(frame, verbose=False)[0]
for box in results.boxes:
if model.names[int(box.cls)] == "person":
x1, y1, x2, y2 = box.xyxy[0].int().tolist()
obj_center_x = (x1 + x2) // 2
error = obj_center_x - center_x
correction = pid.compute(error)
cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
cv2.line(frame, (center_x, 0),
(center_x, frame.shape[0]), (255, 0, 0), 2)
cv2.line(frame, (obj_center_x, 0),
(obj_center_x, frame.shape[0]), (0, 0, 255), 1)
direction = "LEFT" if correction > 0 else "RIGHT"
cv2.putText(frame, f"Error: {error}px", (10, 30),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)
cv2.putText(frame, f"PID: {correction:.1f}", (10, 60),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 255), 2)
cv2.putText(frame, f"Steer: {direction}", (10, 90),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 0, 255), 2)
cv2.imshow("PID Object Tracking (Tesla lane centering)", frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
This simulates the same logic Tesla uses for lane centering — measure error from desired position, apply PID correction, continuously minimize error.
Where Each Term Dominates in Tesla
| Situation | Dominant Term |
|---|---|
| Car drifting from lane center | P (big error, correct now) |
| Persistent crosswind | I (keeps pushing off → I builds up) |
| Approaching sharp curve | D (error changing fast → anticipate) |
| Sudden obstacle | P + D (large error + fast change) |
Tesla’s Full Control Stack
Neural Network (FSD) ← high-level decisions
↓
MPC (Model Predictive Control) ← trajectory tracking
↓
PID Controllers ← actuator commands
↓
Steering Motor / Throttle / Brakes
In FSD v12, the neural network increasingly replaces explicit MPC and PID with learned control — the planning and control steps are implicit inside the network weights. But PID still handles the lowest-level actuator precision.
Part 6: Why Robots Need Foundation Models
Traditional Robot: Foundation Model Robot:
"pick red ball" vs "grab the thing that
(hardcoded) rolls and bounces"
breaks if ball works on any new object
changes color it has never seen before
1. Robot Policy Learning
The old way: engineers write thousands of IF/THEN rules for every scenario. The problem: infinite edge cases (wet objects slip, fragile objects break, irregular shapes have no rule). Foundation models watch human demonstrations and learn generalizable policies that work on objects never seen in training.
2. Value Learning
Instead of fixed reward functions that get hacked (robot flips the box upside down to “complete” the task), value learning assigns cost to states. Low cost: object grasped securely, smooth motion. High cost: object falling, jerky motion, near collision. The robot chooses the lowest-cost path through state space.
3. High-Level Task Planning
Traditional: pre-program every sub-step of “make coffee” — fails if the mug is in a different location or someone says “brew me a cup” instead. LLMs understand natural language, generate plans automatically, and each step gets sent to lower-level controllers.
4. LLM Code Generation
New task needed? The old way: hire an engineer, weeks of coding. With LLM code generation (e.g. GenSim), describe the task in English and get simulation environment, reward function, training scenarios, and test cases — minutes instead of months.
5. Open Vocabulary Object Detection
Traditional detectors only recognize categories seen during training. A hospital robot sees an IV drip stand and outputs UNKNOWN. Open vocabulary models (like DINO) understand any text description: “the thing that holds the fluid bag on a pole” → detects IV drip stand correctly, never seen it during training.
6. Semantic Segmentation
Object detection draws a box. But a robot needs exact pixel boundaries to know where to grip:
Bounding box: ████████ ← robot grips the air next to the cup
████████
Segmentation: ░░░░░░
░░██████░░
░░██████── ← handle identified separately
░░██████░░
░░░░░░
Robot knows: grip the body (not the handle, too fragile), handle is an alternative grip point, bottom is a stable placement point.
7. 3D Representation and Affordance
Affordance = what can you DO with an object. Traditional robots see a cup and know it IS a cup, but don’t know which end to drink from, where the handle is grippable, or whether it will tip. Foundation models infer affordances from shape geometry, material properties, and physical common sense learned from internet data.
How It All Connects
Human: "put the fragile glass on the high shelf carefully"
↓
LLM Task Planning
"fragile" → gentle grip "high shelf" → extend arm "carefully" → slow
↓
Open Vocab Detection
Finds "glass" in scene — never seen this exact glass before
↓
Semantic Segmentation
Exact shape: rim, body, base locations identified
↓
3D Affordance
Best grip point, force needed, keep upright
↓
Value Learning
Lowest cost path = smooth, slow arc upward
↓
Diffusion Policy
Smooth grasping motion learned from human demos
↓
PID / MPC
Precise actuator control, maintains grip force
↓
Glass placed safely on shelf
Summary
| Problem | Solution |
|---|---|
| Can’t write rules for every scenario | Policy learning from demonstrations |
| Fixed reward functions get hacked | Value learning |
| Can’t pre-program every task | LLM task planning + code generation |
| Can’t recognize every object | Open vocabulary detection |
| Bounding boxes aren’t precise enough | Semantic segmentation |
| Don’t know how to interact with objects | 3D affordance models |
Part 7: RAG for Robots — How Semantic Memory Enables Navigation
Traditional RAG: Robotics RAG:
Documents → vectors Camera frames → vectors
"find relevant text" "find relevant memory"
answer questions navigate and act in world
Everything a robot sees becomes a “document” stored in vector space.
How Robot Memory Gets Built
t=0s: Camera captures kitchen
VLM captions: "kitchen counter, chips bag on left shelf,
microwave on right, blue mug near sink"
→ embed caption → store vector + timestamp + location
t=30s: Camera captures window area
VLM captions: "large window, green plants, city view"
→ embed → store
t=60s: Camera captures conference room
VLM captions: "meeting table, whiteboard, projector"
→ embed → store
... robot builds semantic map of entire environment
How RAG Retrieval Works for Navigation
Human says: "get me the chips"
↓
Speech → Text (Whisper)
↓
Embed the query: "chips" → [0.23, 0.87, 0.12, ...]
↓
Vector similarity search:
Score 0.94: "kitchen counter, chips bag on left shelf" ← MATCH
Score 0.71: "break room, snack cabinet, various foods"
Score 0.43: "office desk, papers, coffee cup"
↓
Top result: kitchen, left shelf
↓
Robot navigates and retrieves chips
The key: “chips” was never stored as an exact keyword. The vector embedding captures semantic meaning, so the query also finds “snack cabinet” and “Doritos on counter” — same concept, different words.
Full Architecture
┌─────────────────────────────────────────────┐
│ Camera (30fps) → VLM captions every 3s │
│ Microphone → Whisper → text │
│ GPS/Odometry → location coordinates │
└─────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Embed → Vector Database (FAISS/Milvus) │
│ [vec1, vec2, vec3 ... millions of memories] │
└─────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Query → embed → similarity search │
│ Top K memories → inject into LLM context │
│ LLM reasons: "Go to kitchen, chips on left" │
└─────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Navigation → path to kitchen │
│ Manipulation → grasp chips bag │
│ Deliver to human │
└─────────────────────────────────────────────┘
Code: Build a Robot Memory System
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from datetime import datetime
# pip install faiss-cpu sentence-transformers
class RobotMemory:
def __init__(self):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
self.dimension = 384
self.index = faiss.IndexFlatL2(self.dimension)
self.metadata = []
def store_observation(self, caption, location=(0, 0)):
"""Store a robot observation as a memory"""
vector = self.model.encode([caption])[0]
self.index.add(np.array([vector], dtype=np.float32))
self.metadata.append({
'caption': caption,
'location': location,
'timestamp': datetime.now().isoformat(),
'id': len(self.metadata)
})
print(f"Stored memory #{len(self.metadata)-1}: "
f"'{caption}' at {location}")
def query(self, question, top_k=3):
"""Find most relevant memories for a question"""
query_vector = self.model.encode([question])[0]
distances, indices = self.index.search(
np.array([query_vector], dtype=np.float32), top_k
)
results = []
for dist, idx in zip(distances[0], indices[0]):
if idx < len(self.metadata):
memory = self.metadata[idx].copy()
memory['similarity_score'] = float(1 / (1 + dist))
results.append(memory)
return results
def decide_action(self, query):
"""RAG: retrieve memories + reason about action"""
memories = self.query(query, top_k=3)
print(f"\nQuery: '{query}'")
print("Relevant memories found:")
for m in memories:
print(f" [{m['similarity_score']:.2f}] "
f"'{m['caption']}' at {m['location']}")
best = memories[0]
print(f"\nAction: Navigate to {best['location']}")
print(f"Reason: '{best['caption']}' found there")
return best['location']
# Simulate robot exploring environment
robot = RobotMemory()
robot.store_observation(
"kitchen counter with chips bag on left shelf", location=(10, 5))
robot.store_observation(
"large window with green plants and city view", location=(25, 3))
robot.store_observation(
"conference room with whiteboard and projector", location=(15, 20))
robot.store_observation(
"break room with coffee machine and mugs", location=(8, 15))
robot.store_observation(
"storage cabinet with snacks and beverages", location=(12, 6))
robot.store_observation(
"office desk with computer monitor and papers", location=(20, 10))
print("\n" + "=" * 50)
print("ROBOT MEMORY SYSTEM READY")
print("=" * 50)
robot.decide_action("get me something to eat")
robot.decide_action("I want to see something beautiful outside")
robot.decide_action("where can I make coffee")
robot.decide_action("find me a place to have a meeting")
Output:
Query: 'get me something to eat'
Relevant memories found:
[0.82] 'kitchen counter with chips bag on left shelf' at (10, 5)
[0.79] 'storage cabinet with snacks and beverages' at (12, 6)
[0.61] 'break room with coffee machine and mugs' at (8, 15)
Action: Navigate to (10, 5)
Reason: 'kitchen counter with chips bag on left shelf' found there
Why RAG Beats Traditional Navigation
| Traditional Robot Navigation | RAG-Based Robot Memory |
|---|---|
| Needs explicit map: “chips at (10,5)” | Builds map automatically from observations |
| Breaks if object moves | Updates when robot re-explores |
| Can’t understand “something to eat” | Understands any natural language query |
| New objects require reprogramming | Generalizes to new objects and queries |
| No memory across sessions | Memory persists indefinitely |
This is exactly what NVIDIA demonstrated — a robot that remembers what it has seen and navigates based on semantic memory, not hardcoded maps.
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026