Akshay Parkhi's Weblog

On ai-agents 43 physical-ai 18 ...

 

Recent

March 7, 2026

How AWS Strands Hooks Work

Hooks are an event-driven extensibility system — a way to inject custom logic at specific points in the agent lifecycle without modifying core code. Think of them as middleware/interceptors.

[... 713 words]

How AWS Strands Agent Loop Works

Strands uses the ReAct pattern — a recursive loop where the LLM reasons, optionally calls tools, observes results, then reasons again. It is NOT prompt chaining (where you have a fixed sequence of prompts). The loop is open-ended and driven by the model’s decisions.

[... 650 words]

How Claude Team Agents ACTUALLY Connect — No Fluff

Agents are separate processes. They don’t share memory. They don’t call each other’s functions. They communicate through files on disk and CLI commands that Claude Code provides as tools.

[... 2,382 words]

How Pi Builds Its System Prompt at Runtime — And the Innovations That Make It Stand Out

A deep dive into the open-source coding agent that assembles its brain on the fly.

[... 2,918 words]

Scaling Agents: The Definitive Open-Source Guide — From 1 Agent to 100 Agents, 1 Tool to 100 Tools, Managing Context

Tool Wall: Every tool’s JSON schema goes into the system prompt. At 15+ tools, models start selecting wrong tools. At 50+, token usage explodes and accuracy plummets.

[... 4,423 words]

March 6, 2026

Finding the Perfect Prompt: Combining DSPy’s Optimization with Strands Agents for Cost-Effective Multi-Agent Systems

How we used Bayesian optimization to find better prompts automatically — and made cheap models perform like expensive ones.

[... 2,460 words]

March 5, 2026

How to Save 90% on Agent Token Costs with Prompt Caching on AWS Bedrock

How I reduced my AI agent’s input token costs by 90% using prompt caching on AWS Bedrock — with real pricing data and hands-on examples using Strands Agents.

[... 2,661 words]

Complete Guide: Setting Up XRoboToolkit for Robot Teleoperation with Pico 4 Ultra on WSL2

A step-by-step guide to setting up XR-based robot teleoperation using the Pico 4 Ultra headset, XRoboToolkit, and MuJoCo simulation — all running on Windows WSL2.

[... 1,293 words]

March 4, 2026

XR-Robotics with Pico 4 Ultra: VR Teleoperation Setup from Headset to Robot Simulation

I’ve been setting up XR-Robotics with a Pico 4 Ultra headset to teleoperate robot arms in simulation — and eventually collect demonstration data for imitation learning. The setup spans a PC running Ubuntu, a Python teleoperation stack, and a VR headset acting as the human interface. Here’s the complete step-by-step guide.

[... 2,636 words]

March 2, 2026

ROS 2 Humble: Complete Installation Guide with Turtlesim from Zero to First Node

This is a complete walkthrough for installing ROS 2 Humble on Ubuntu 22.04 and getting your first robot simulation running with Turtlesim. I wrote this after going through the process myself — the official docs are thorough but scattered across many pages. This puts everything in one place, from locale setup to writing your first Python node.

[... 2,166 words]

RDF, ROS, and Sim-to-Real: Understanding Robot Description Files

When you start working with robot simulation — whether it’s Isaac Sim, Gazebo, or MoveIt — you immediately run into a file called something.urdf. It’s one of those things that seems simple on the surface but connects to everything in the robotics stack. Here’s a clear breakdown of what URDF is, what it isn’t, and how it fits alongside ROS.

[... 1,082 words]

OpenTelemetry for AI Agents: How the Strands SDK Instruments Traces, Metrics, and Token Usage

I’ve been digging into the Strands Agents SDK and was surprised to find a comprehensive, production-ready OpenTelemetry integration baked right in. If you’re building AI agents and wondering how to get visibility into what’s actually happening at runtime — model calls, tool executions, latencies, token usage — this is worth understanding.

[... 1,384 words]

March 1, 2026

How a VLA Controls a Robot Arm: GR00T N1.5 System Architecture from Camera to Motor

I’ve been building a robot arm system that uses NVIDIA’s GR00T N1.5 — a Vision-Language-Action (VLA) model — to pick up objects from a table using only a camera, natural language instructions, and 50 demonstration episodes. After getting it working end-to-end, I wanted to write down the full system architecture for anyone trying to understand how all the pieces connect.

[... 912 words]

Feb. 28, 2026

Collecting Training Data for VLA Robot Fine-Tuning (The Hard Way)

A Vision-Language-Action model takes camera images and a language instruction as input, and outputs robot joint actions. NVIDIA’s GR00T N1.5 is one such model — pre-trained on millions of robot demonstrations and fine-tunable for your specific robot and task. The catch: even though GR00T is pre-trained, you still need your own demonstrations to teach it your robot’s exact joint calibration, camera angles, and task environment. Without this, the model generates actions that are plausible in general but wrong for your specific setup.

[... 1,771 words]

Feb. 27, 2026

AWS Bedrock AgentCore Async Agents

AWS Bedrock AgentCore lets you deploy AI agents as managed microVMs with built-in health checks, session management, and async task support. The async pattern is the interesting part — your agent responds immediately, runs work in the background, and the client polls for results. Here’s how the architecture works.

[... 1,172 words]

Feb. 26, 2026

Smartphone Photos to Synthetic Training Data: A 3D Reconstruction Pipeline

This pipeline turns smartphone photos of your home into synthetic training data — RGB images, depth maps, and camera parameters from viewpoints that never existed. You capture photos, reconstruct a 3D model, then render unlimited novel views. Here’s how the five-stage pipeline works.

[... 846 words]

Feb. 24, 2026

Teaching a Humanoid Robot to Wave: Custom Motions with GEAR-SONIC

GEAR-SONIC can track arbitrary motions — not just its built-in locomotion styles. You define joint angles in a CSV, preview them with direct replay in MuJoCo, then deploy through the SONIC neural network. Here’s how the three-stage pipeline works.

[... 812 words]

Feb. 22, 2026

VLA → WBC → MuJoCo: Two Ways to Wire Up NVIDIA’s GR00T Humanoid Stack

There are two ways to wire up NVIDIA’s GR00T stack from vision-language all the way down to physics simulation: the official NVIDIA eval pipeline and a custom pipeline using the SONIC C++ binary. I’ve set up both. Here’s how they work and where they differ.

[... 674 words]

From Vision to Torques: How NVIDIA’s GR00T Stack Controls a Humanoid Robot

NVIDIA’s GR00T stack for humanoid robots has three layers: a Vision-Language-Action model that understands what to do, a whole-body controller that figures out how to move, and a physics simulator that validates it all before touching real hardware. Here’s how they connect.

[... 976 words]

Feb. 21, 2026

GEAR-SONIC

GEAR-SONIC (Supersizing Motion Tracking for Natural Humanoid Whole-Body Control) is the big upgrade over the Decoupled WBC approach in the GR00T stack. It’s a completely different approach to humanoid control — unified whole-body, trained on human motion data rather than hand-crafted reward functions.

[... 325 words]

Feb. 20, 2026

NVIDIA’s GR00T Whole-Body Control stack in MuJoCo

I’ve been running NVIDIA’s GR00T Whole-Body Control stack in MuJoCo — the sim-to-real bridge for humanoid robot locomotion. A MuJoCo viewer showing a simulated robot walking might look like a toy, but the neural network policy inside it is the same binary that runs on a real Unitree G1. Here’s what’s actually going on.

[... 759 words]

Understanding LLM-Driven Python Execution: Architecture, Terminology, and Use Cases

This pattern is not just “tool use.” It is a Reasoning → Execution → Observation loop where the LLM can generate Python during runtime and run it inside a sandbox, producing deterministic outputs.

[... 646 words]

Eval methods for Tools, Skills, and Prompts, and how to ensure correctness

Eval methods for Tools, Skills, and Prompts, and how to ensure correctness


1. Evaluating Tools (MCP / Agentic Tools)

Tools have the most structured evaluation surface because they expose defined inputs and outputs.

Metrics to Measure

Metric What It Checks
Tool Correctness Did the agent pick the right tool?
Argument Correctness Were the arguments passed correctly?
Ordering Were tools called in the right sequence?
Task Completion Did the end-to-end trajectory achieve the goal?

Methods

a) Deterministic comparison (fastest, most reliable)

Compare tools_called vs expected_tools — name match, argument match, and order match.

from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric

test_case = LLMTestCase(
    input="What's the return policy?",
    actual_output="We offer a 30-day refund.",
    tools_called=[ToolCall(name="WebSearch"), ToolCall(name="ToolQuery")],
    expected_tools=[ToolCall(name="WebSearch")],
)

metric = ToolCorrectnessMetric(threshold=0.7)

# Score = Correctly Used Tools / Total Tools Called

b) Trajectory-level evaluation

Do not just check the final output. Evaluate the full sequence of tool calls to detect missing tools, extra tools, and parameter mismatches.

c) LLM-as-judge fallback

When tool usage is correct but non-obvious, use a judge model to assess whether the chosen tools were optimal given the available tools context.


2. Evaluating Skills

Skills require evaluation of both activation (did the correct skill load?) and output quality (did it improve results?).

a) Skill Activation / Routing Evals

Prompt Expected Skill should_trigger
Review this PR for security issues security-review true
Fix the typo on line 3 security-review false
Check this code for vulnerabilities security-review true

Grading is deterministic pass or fail — did the expected skill activate?

b) Skill Output Quality Evals

  • LLM-as-judge with rubric scoring (1 to 5 scale)
  • Exact or string match for structured sections
  • A/B comparison (with skill vs without skill)

c) Progressive Disclosure Check

Measure token usage when multiple skills are available to ensure context does not grow unnecessarily.


3. Evaluating Prompts

a) Code-based grading (preferred)

# Exact match
def eval_exact(output, expected):
    return output.strip().lower() == expected.lower()

# String containment
def eval_contains(output, key_phrase):
    return key_phrase in output

b) LLM-as-judge (nuanced assessment)

def evaluate_likert(model_output, rubric):
    prompt = f"""Rate this response on a scale of 1-5:
    <rubric>{rubric}</rubric>
    <response>{model_output}</response>
    Think step-by-step, then output only the number."""
    return call_judge_model(prompt)

c) Embedding similarity

Use cosine similarity to ensure paraphrased inputs produce semantically consistent outputs.

d) ROUGE-L for summarization

Measures overlap between generated and reference summaries.


Universal Best Practices

  1. Volume over perfection — automate at scale.
  2. Include edge cases such as typos, ambiguity, long inputs, and topic shifts.
  3. Use a different model as judge than the one being evaluated.
  4. Ask the judge to reason before scoring.
  5. Automate and version evaluations like tests.
  6. Combine deterministic checks with LLM-based scoring.

Quick Reference

  • TOOLS — deterministic tool and argument match plus trajectory validation
  • SKILLS — activation tests plus rubric-based output quality
  • PROMPTS — exact match where possible plus LLM-judge for qualitative tasks

[... 530 words]

Prompt vs Skill vs Tool

1) Prompt (Runtime System Prompt)

What it is: Instructions passed in the API call for a single request.

  • One-time instruction (per request)
  • Not enforced; the LLM may skip or reorder steps
  • Good for quick control (tone, format, role)

Use when: prototyping, low-risk tasks, or temporary behavior changes.

Avoid when: you need guaranteed step execution or strict sequencing at scale.

2) Skill (Reusable Structured Prompt Module)

What it is: A reusable, structured reasoning template/module that improves consistency across repeated tasks.

  • Reusable and standardized
  • More consistent than ad-hoc prompts
  • Still LLM-driven (probabilistic), not a hard execution engine

Use when: the task repeats often and you want consistent analysis structure, formatting, or output schema.

Avoid when: the workflow must never skip steps or must follow an exact sequence every time.

3) Tool (Deterministic Capability)

What it is: An executable function that performs a real action (API call, database query, file write, etc.).

  • Deterministic execution (given correct code and inputs)
  • Interacts with real systems or data
  • Auditable and testable

Use when: you need real data, guaranteed operations, and repeatable correctness.

Important: Orchestrator (Code) for Strict Multi-Step Workflows

If your process requires a fixed sequence of steps that must always execute in order, the most reliable design is:

  1. Use code (an orchestrator or state machine) to enforce the required steps deterministically.
  2. Then pass the final combined results to the LLM for reasoning, optionally using a Skill for consistent formatting.

Quick Decision Rule

  • Need guaranteed execution? Use Tools + Orchestrator.
  • Need consistent repeated reasoning/output? Use a Skill.
  • Need a one-off behavior tweak? Use a Prompt.

[... 273 words]

Feb. 19, 2026

GR00T Architecture: A Systems Engineering Breakdown

GR00T is not just a VLM. It is a Perception → Reasoning → Control generator stack.

[... 762 words]

GR00T N1.6 Fine-Tuning — Full Internal Deep Dive

GR00T N1.6 is NVIDIA’s Vision-Language-Action (VLA) model for humanoid robot control. After spending time digging through the internals, here’s a comprehensive deep dive into exactly how fine-tuning works — from model architecture to gradient flow to the data pipeline.

[... 1,200 words]

PPO vs VLM

Modern humanoid robots combine two fundamentally different kinds of intelligence:

[... 638 words]

Feb. 18, 2026

How GR00T Merges Vision, Chat, and Action

The biggest challenge is that vision models speak “Image-ish” (pixels) while chat models speak “Text-ish” (tokens). GR00T uses a specialized component called a Projector to act as a real-time translator.

[... 377 words]

GR00T N1.6 Architecture and Parameter Distribution

GR00T uses a massive “backbone” to understand its surroundings. It combines SigLIP 2 (for vision) and Qwen 3 (for language). While the eyes are frozen to keep perception stable, the reasoning layers are partially trainable to help the robot learn specific tasks.

[... 362 words]

Feb. 16, 2026

What I Learned Building a Streaming Agent on AWS Bedrock AgentCore Runtime

What I Learned Building a Streaming Agent on AWS Bedrock AgentCore Runtime

I spent a week building a conversational agent on AWS Bedrock AgentCore Runtime. It supports hybrid memory (fast in-session + persistent cross-session), real-time token streaming, multi-user isolation, and a 12-test automated suite. Here’s everything I wish someone had told me before I started.

The full source code is available on GitHub: github.com/avparkhi/AWS-BedrockAgentCore-Testing


1. The MicroVM Mental Model

AgentCore doesn’t run your agent in a container you manage. It runs inside microVMs — tiny, isolated virtual machines that spin up per-session. The first request to a new session triggers a cold start (~1.5–10s depending on what you initialize). Subsequent requests to the same session hit a warm microVM where all your Python globals are still in memory (~180ms overhead).

This means you can cache expensive things — model clients, config, database connections — as module-level globals, and they’ll persist across warm invocations. But the moment a user creates a new session, a new microVM spins up and everything resets.

Think of it like AWS Lambda, but with a session-sticky routing layer on top.

2. Memory is a Two-Tier Problem

The session-sticky microVM model creates an interesting memory challenge:

  • Within a session (warm starts): You want fast, in-process memory. A simple Python dictionary works — zero latency, no API calls. This is your RAM layer.
  • Across sessions (cold starts, redeploys, crashes): You need persistent storage. AgentCore provides a Memory API for this — durable, survives everything, but adds ~300ms per read.

The hybrid approach: always save to durable memory (background, on every message). On warm starts, use RAM. On cold starts, load from durable memory. The user never notices the difference.

Observation: Making the model global (reused across invocations) and the agent per-invocation (lightweight, so lifecycle hooks fire correctly) was the right split. Model initialization is expensive; agent creation is cheap.

3. Streaming Changes the User Experience Dramatically

Without streaming, a typical warm chat response takes ~5.4 seconds. The user sees nothing until the entire response is ready. With streaming, the first token appears in ~1–1.5 seconds. The total time is the same, but the perceived latency drops by 75%.

MetricNon-StreamingStreaming
Time to first visible output~5.4s~1–1.5s
Total response time~5.4s~5.4s
User perception“Is it frozen?”“It’s thinking and typing”

If your agent takes more than 2 seconds to respond, streaming isn’t a nice-to-have — it’s table stakes for user experience.

4. The SDK Has Opinions About Streaming (And They’re Inconvenient)

The AgentCore SDK supports streaming on the server side beautifully — if your entrypoint returns a Python generator, the SDK automatically detects it and wraps the response as Server-Sent Events (SSE). No configuration needed.

The problem is on the client side. The SDK’s built-in invoke method handles streaming by printing chunks directly to the console. It then returns an empty dictionary. There is no programmatic access to the streamed content.

If you want to actually parse streaming responses in your own client — show metadata, measure time-to-first-token, display formatted output — you need to bypass the SDK and call the API directly via boto3. It’s not hard, but it’s undocumented and unexpected.

5. The Five Gotchas

These are the things that cost me the most time and are not documented anywhere I could find.

Gotcha #1: The Payload Isn’t a Real Dict

The payload your entrypoint receives looks like a Python dictionary, but it’s actually a JSONSerializableDict. It behaves differently in subtle ways:

  • The two-argument form of .get(key, default) throws a TypeError. Only the single-argument form works.
  • Bracket access (payload["key"]) throws a TypeError — no subscript support.
  • Assignment (payload["key"] = value) also fails.

The workaround is to use single-argument .get() with an or fallback for defaults. It’s ugly, but it works reliably.

The same limitation applies to agent state objects in the Strands SDK. Don’t try to use them as dicts — store your state elsewhere.

Gotcha #2: Return Values Get Double-Encoded

The SDK JSON-encodes whatever your entrypoint returns. If you return a JSON string, the client receives a JSON-encoded JSON string — with escaped quotes, escaped newlines, the works. You end up needing to iteratively decode the response (sometimes 3–4 rounds) to get the original data.

I ended up using a custom tag format for metadata instead of JSON, just to avoid the encoding nightmare.

Gotcha #3: Streaming Generators Have the Same Encoding Trap

This was the most frustrating bug. When your streaming generator yields values, the SDK JSON-encodes each one for the SSE wire format. If you pre-encode your dicts to JSON strings before yielding, the SDK encodes them again. The client receives a quoted string instead of a parseable dict.

The symptom: metadata objects appear as raw text inline with the chat response, and all metadata fields show as “unknown.”

The fix: yield raw Python objects (dicts, strings) and let the SDK handle serialization. Never call json.dumps() on values you’re about to yield from a streaming generator.

The rule I wish I’d known from the start: In streaming generators, the SDK is the serializer. You are the data source. Don’t do the SDK’s job — it will do it again on top of yours.

Gotcha #4: Non-Streaming Responses Silently Truncate at 1024 Bytes

There is no error. There is no warning. If your non-streaming response exceeds 1024 bytes, it gets cut off mid-JSON. You spend an hour debugging your JSON parser before realizing the data was simply incomplete.

Keep non-streaming responses compact. For my memory query mode, I had to truncate each conversation turn to ~80 characters to stay under the limit.

Gotcha #5: A Generator Function Can’t Conditionally Stream

In Python, a function that contains yield anywhere in its body always returns a generator — you can’t conditionally return a string vs. yield chunks from the same function.

The solution is a dispatch pattern: your entrypoint is a regular function that returns either a generator object (for streaming) or a string (for non-streaming). The SDK inspects the return value’s type, not the function itself. Two separate handler functions, one thin dispatcher.

6. Testing Observations

Test Through the Non-Streaming Path

All 12 of my automated tests use non-streaming mode. Streaming responses are harder to assert against (you’d need to reconstruct the full text from chunks). Non-streaming gives you a clean, complete response string to validate.

Streaming is a transport concern. If your logic works non-streaming, it works streaming — the same code runs either way, just yielded differently.

Memory Persistence Is the Key Test

The most important test: send a math problem on session A, then ask about the result on a completely different session B. Session B has empty RAM (new microVM). If the agent remembers the answer, durable memory is working. If the response attributes the memory source as “durable,” the hybrid architecture is working correctly.

Isolation Tests Are Easy to Get Wrong

Testing that User 2 doesn’t see User 1’s history is tricky because LLMs can coincidentally mention the same numbers. Checking that the response doesn’t contain both “50” and “48” (from User 1’s specific calculation chain) is more robust than checking for either alone. Even so, these tests can be flaky.

Observed Latencies

OperationCold StartWarm
Ping (no LLM, pure microVM overhead)~1.7s~180ms
Chat (LLM + memory hooks)~8–17s~5.4s
Chat streaming (time to first token)~1–1.5s
Memory query (durable read, no LLM)~350ms

The wide range on cold-start chat (8–17s) is due to both microVM provisioning and the first-time model/memory client initialization. Subsequent cold starts are faster (~8s) because ECR image layers are cached.

7. Observability Setup

AgentCore supports three observability pillars, all routed through the runtime role’s IAM permissions:

PillarWhere It GoesWhat You See
LogsCloudWatch LogsTwo streams: runtime-logs (your print output) and otel-rt-logs (OpenTelemetry)
TracesX-Ray via OTelEnd-to-end request traces with timing breakdown
MetricsCloudWatch MetricsInvocation count, latency, errors under the bedrock-agentcore namespace

The runtime role needs permissions for all three: log group/stream creation and writes, X-Ray segment and telemetry submission, and CloudWatch metric publishing scoped to the bedrock-agentcore namespace.

Easy to miss: The runtime config must set the server protocol to HTTP for runtime-logs to be captured. Without this, your agent’s print statements go nowhere.

The GenAI Observability Dashboard in the CloudWatch console provides a unified view across all your AgentCore agents — worth bookmarking.

8. Deployment Notes

  • CodeBuild is the default and easiest path. It builds ARM64 containers in the cloud — no local Docker required. Build time is ~40 seconds.
  • Use your venv Python. On macOS, the system Python is typically 3.9.x. The SDK needs 3.11+. I wasted time on cryptic import errors before realizing I was using the wrong interpreter.
  • Memory resources are idempotent-ish. Creating a memory that already exists throws a ValidationException. You need to catch it and look up the existing one via list. Not a big deal, but not obvious from the API.
  • The agent ARN lives in the YAML config (.bedrock_agentcore.yaml), not in your deploy info JSON. You need it for direct boto3 calls. The SDK handles this internally when you use its invoke method.
  • Redeployment is seamless. Update your code, run deploy, the agent updates in place. Session IDs reset, but durable memory persists. Warm microVMs from the old version drain naturally.

9. What I’d Do Differently Next Time

  1. Build non-streaming first, add streaming last. Streaming is purely a transport layer concern. Get your agent logic, memory, and tools working with simple request/response, then layer streaming on top. Debugging streaming issues while your core logic is also broken is miserable.
  2. Assume everything the SDK touches will be JSON-encoded. Return values, streaming yields, headers — if the SDK handles it, expect encoding. Design your data formats around this from day one.
  3. Keep non-streaming responses tiny. The 1024-byte truncation limit means you should design response formats to be compact from the start, not try to shrink them after the fact.
  4. Write the ping test first. A no-LLM ping mode that returns cache status and invocation count is invaluable for debugging microVM behavior. It isolates platform issues from agent logic issues. Every new agent I build will start with this.
  5. Use tags, not JSON, for non-streaming metadata. A simple delimited format like <<META:key=val,key=val>> survives the SDK’s encoding gauntlet intact. JSON metadata in non-streaming responses is a losing battle.
End result: A production-ready agent with 12/12 tests passing, hybrid memory that works seamlessly across sessions and restarts, streaming that drops perceived latency from 5.4s to 1.5s, and a clear understanding of every sharp edge in the platform.

[... 1,836 words]

Highlights