Akshay Parkhi's Weblog

Subscribe

The 95% Rule: Why Your Agent Is Slow and How to Prove It

12th March 2026

Your agent takes 5 seconds to respond. Where did those 5 seconds go? AgentCore gives you 6 observability layers, 30 hidden metrics, and a debugging decision tree — but you have to know where to look. Here’s everything you can’t see by just reading the code.

The 6 Layers of Observability

AgentCore gives you 6 distinct observability layers, each revealing different things:

Layer 1: CLIENT-SIDE TIMING
  You measure this yourself (time.time() around invoke_agent_runtime)
  Shows: Total end-to-end latency including network
  Blind spot: Can't see what's happening inside

Layer 2: RUNTIME LOGS (CloudWatch Logs → [runtime-logs] streams)
  Your print() statements + bedrock_agentcore framework logs
  Shows: Request arrival, tool calls, completion time, errors
  Blind spot: No per-component breakdown

Layer 3: OTEL TRACE EVENTS (CloudWatch Logs → otel-rt-logs stream)
  Every message in the LLM conversation
  Shows: System prompt, user input, LLM response, tool calls, tool results
  Blind spot: No timing (just message content)

Layer 4: OTEL EMF METRICS (CloudWatch Logs → otel-rt stream)
  Embedded Metric Format — auto-extracted into CloudWatch Metrics
  Shows: Per-request LLM duration, tool duration, token counts, TTFT
  Blind spot: Aggregated per-request (no per-message timing)

Layer 5: AWS/Bedrock-AgentCore METRICS (CloudWatch Metrics namespace)
  AWS-measured metrics from OUTSIDE the microVM
  Shows: End-to-end latency with percentiles, errors, throttles, billing
  Blind spot: No inside-the-VM breakdown

Layer 6: CLOUDWATCH LOGS INSIGHTS (query engine)
  SQL-like queries across all log streams
  Shows: Aggregations, patterns, statistics across all invocations
  Blind spot: Query syntax is limited, 5-second minimum delay

The Sidecar Tax — The Time You Can Never See

Every request passes through the sidecar (port 9000) before reaching your code (port 8080). The sidecar adds 50-200ms for TLS termination, auth token validation, session ID → microVM routing lookup, request serialization, and HTTP forwarding to :8080.

Sidecar tax = Client total time - http.server.duration (EMF metric)

For our test: 5.544s (client) - 4.615s (http.server.duration) = 0.929s sidecar + network

On cold starts, this includes Firecracker microVM boot (125ms) + Python startup + your imports.

Two Log Streams, Completely Different Data

Log Group: /aws/bedrock-agentcore/runtimes/{AGENT_ID}-DEFAULT
│
├── 2026/03/12/[runtime-logs]ed8b8c65-...    ← MicroVM instance #1
├── 2026/03/12/[runtime-logs]375e9614-...    ← MicroVM instance #2
├── 2026/03/12/[runtime-logs]212edc45-...    ← MicroVM instance #3
│   ... (one stream per microVM that ever existed)
│
└── otel-rt-logs                              ← ALL OTel data (shared stream)

The UUID in [runtime-logs]<uuid> IS the Firecracker microVM instance ID. If you see the same UUID handling multiple requests, those requests hit the same warm microVM (sticky session working). If you see different UUIDs, those were different microVMs (cold starts or load balancing).

Embedded Metric Format (EMF) — Metrics Without put_metric_data

OTel logs contain _aws.CloudWatchMetrics JSON blocks. CloudWatch automatically extracts these into metrics without you calling put_metric_data():

{
  "_aws": {
    "Timestamp": 1773335423274,
    "CloudWatchMetrics": [{
      "Namespace": "bedrock-agentcore",
      "Metrics": [{"Name": "strands.tool.duration", "Unit": "Seconds"}],
      "Dimensions": [["tool_name", "tool_use_id"]]
    }]
  },
  "strands.tool.duration": {"Values": [0.003], "Counts": [1]},
  "tool_name": "calculator",
  "tool_use_id": "tooluse_vEjG3idNjMdOhbBd3peHaL"
}

The OTel collector on port 8000 inside the microVM receives traces from opentelemetry-instrument, converts them to EMF, and writes them to CloudWatch Logs. CloudWatch then auto-extracts the metrics.

Trace ID = Your Request’s DNA

Every OTel event has a traceId field. All events from the same invoke_agent_runtime() call share the same traceId. The spanId changes per operation:

traceId: 69b2f37963a139ff1d6114ea6b800056  (one per request)
├── spanId: f9aff898  → gen_ai.system.message (LLM call #1 start)
├── spanId: f9aff898  → gen_ai.user.message
├── spanId: f9aff898  → gen_ai.choice (tool_use)
├── spanId: 91da9ac0  → strands.telemetry.tracer (cycle #1 end)
├── spanId: 8cba081d  → strands.telemetry.tracer (tool result)
├── spanId: 57695a94  → gen_ai.system.message (LLM call #2 start)
├── spanId: 57695a94  → gen_ai.choice (end_turn)
├── spanId: f2f863700 → strands.telemetry.tracer (cycle #2 end)
├── spanId: ee60336f  → strands.telemetry.tracer (agent complete)
└── spanId: 9f9f5122  → bedrock_agentcore.app "Invocation completed (4.613s)"

To debug a specific slow request: grep for the session_id in OTel logs, get the traceId, then filter ALL OTel events by that traceId.

The Event Loop Is The Agent’s Heartbeat

User prompt arrives
    │
    ▼
┌─ CYCLE 1 ─────────────────────────────────┐
│  1. Build messages (system prompt + input)  │
│  2. Call LLM (Bedrock)                      │  ← Most time here
│  3. LLM returns: tool_use or end_turn       │
│  4. If tool_use: execute tool               │  ← Second most time
│  5. Append tool result to messages          │
└─────────────────────────────────────────────┘
    │ (if tool_use, loop back)
    ▼
┌─ CYCLE 2 ─────────────────────────────────┐
│  1. Build messages (now includes cycle 1)  │
│  2. Call LLM again                         │  ← Context now LARGER
│  3. LLM returns: end_turn                  │
└────────────────────────────────────────────┘
    │
    ▼
Return response to user

Each cycle is measured by strands.event_loop.cycle_count and strands.event_loop.latency.

Token Growth — The Silent Performance Killer

Simple request: "What is 2+2?"  (2 cycles)
  Cycle 1: 1752 input tokens → 44 output (tool_use)   = 2.916s
  Cycle 2: 1822 input tokens → 54 output (final text)  = 1.489s
  Token growth: +70 tokens (+4%)

Complex request: "15*37, add 42, tell me the time"  (2 cycles)
  Cycle 1: 1771 input tokens → 100 output (tool_use)   = 3.072s
  Cycle 2: 1952 input tokens → 117 output (final text)  = 3.451s
  Token growth: +181 tokens (+10%)

Why it matters: if your agent runs 10 cycles, input tokens grow with every cycle. Cycle 1 might process ~1,750 tokens in ~1.5s, but cycle 10 processes ~5,000 tokens in ~4.0s. Ten cycles with growing latency = ~30 seconds just for LLM calls. This is the #1 cause of “my agent takes 2 minutes.”

Time-to-First-Token (TTFT) vs Total Duration

gen_ai.client.operation.duration = TTFT + token streaming
                                    ↑         ↑
                              LLM thinking   generating output

For tool_use responses (short):
  TTFT: 2707ms → Total: 2916ms → Streaming: 209ms (7%)

For text responses (longer):
  TTFT: 2662ms → Total: 3455ms → Streaming: 793ms (23%)

High TTFT means the model is thinking longer. Causes: large input context, complex reasoning required, model overloaded (try different region), or using a larger model (Opus > Sonnet > Haiku).

The 95% Rule

Real measurement from a “What is 2+2?” request:

Total inside microVM:  4,613ms (100%)
├── LLM call #1:       2,916ms (63%)
├── LLM call #2:       1,489ms (32%)
├── Tool (calculator):      3ms (0.07%)
└── Overhead:             205ms (4.4%)

LLM TOTAL:             4,405ms (95.5%)

95% of time is LLM inference. This is typical for agents with fast tools. The implication:

  • Optimizing your tool code = marginal gains
  • Switching from Sonnet to Haiku = 2-5x improvement
  • Reducing input tokens by 50% = ~30% improvement
  • Reducing cycles from 5 to 2 = ~60% improvement

Tool Duration Reveals External Dependencies

calculator:  0.003s  (pure computation — instant)
weather:     0.100s  (HTTP call to weather API)
database:    1.200s  (connection + query + serialization)

If you see a tool taking > 1 second, it’s calling an external service. Fix: connection pooling, caching, timeouts, parallel execution.

Two Namespaces, Two Perspectives

AWS/Bedrock-AgentCore (AWS-side — outside the microVM)
├── Latency              ← End-to-end including sidecar (what your user feels)
├── Invocations          ← Count of invoke_agent_runtime() calls
├── Sessions             ← Count of NEW sessions created
├── Errors               ← SystemErrors + UserErrors
├── Throttles            ← Rate limit exceeded
├── CPUUsed-vCPUHours    ← BILLING: CPU usage
└── MemoryUsed-GBHours   ← BILLING: Memory usage

bedrock-agentcore (OTel — inside the microVM)
├── http.server.duration          ← Time inside your code
├── gen_ai.client.token.usage     ← Token counts
├── strands.event_loop.*          ← Event loop metrics
├── strands.tool.*                ← Tool metrics
└── strands.model.time_to_first_token  ← LLM thinking time

Percentiles Tell the Real Story

From a 58-invocation test batch:

Latency (AWS/Bedrock-AgentCore namespace):
  avg: 3,082ms
  min: 1,305ms
  max: 14,849ms  ← 5x slower than average!
  p50: 2,436ms   ← Typical request
  p90: 3,770ms   ← 90% of requests finish by here
  p99: 14,220ms  ← Worst 1% — likely cold starts

p50 (2.4s) vs p99 (14.2s) = 6x difference. The p99 outlier is almost certainly a cold start. If you only look at averages, you miss this entirely.

Sessions vs Invocations

58 invocations but only 31 new sessions. That means 27 requests (47%) hit existing warm sessions — proving sticky routing works. The more your Sessions/Invocations ratio drops, the more you’re benefiting from warm microVMs.

Throttles, Errors, and Hidden Retries

SystemErrors → AWS infrastructure issue. Nothing you can do. Wait and retry.
UserErrors   → Your @app.entrypoint threw an exception. Check runtime logs.
Throttles    → You hit a rate limit. Request increase via Service Quotas.

Hidden: the boto3 client has built-in retry with exponential backoff. A single throttle from AWS may result in 2-3 actual API calls before succeeding. Your client-side timing includes retry time, but the AWS Latency metric only counts the final successful attempt.

CPU and Memory Billing — Real Numbers

CPUUsed-vCPUHours:    0.004628  → $0.000414 (@ $0.0895/vCPU-hr)
MemoryUsed-GBHours:   0.007257  → $0.000069 (@ $0.00945/GB-hr)

Key points:
  CPU is charged only when your code is executing (not idle)
  Memory is charged for 128MB minimum
  Idle sessions = $0 (confirmed in 10-minute idle test)
  Billing is per-second granular, aggregated to hourly metrics

One Log Stream = One MicroVM’s Entire Life

2026/03/12/[runtime-logs]ed8b8c65-d8ce-4287-a67d-8d464523db53

This stream contains ALL logs from microVM ed8b8c65 from boot to termination. If this VM handled 10 requests, all 10 appear in this stream. When the VM is terminated (idle timeout or explicit stop), no more logs appear.

Forensic trick — count how many requests a specific microVM handled:

aws logs filter-log-events \
  --log-group-name "/aws/bedrock-agentcore/runtimes/AGENT_ID-DEFAULT" \
  --log-stream-names "2026/03/12/[runtime-logs]ed8b8c65-..." \
  --filter-pattern "Invocation completed"

CloudWatch Logs Insights — The Power Queries

Duration percentiles across all invocations:

fields @timestamp, @message
| filter @message like /Invocation completed/
| parse @message '"message": "Invocation completed successfully (*s)"' as duration
| stats count() as n,
        avg(duration) as avg_s,
        pct(duration, 50) as p50,
        pct(duration, 90) as p90,
        pct(duration, 99) as p99,
        max(duration) as max_s

Tool usage frequency:

fields @message
| filter @message like /^Tool #/
| parse @message 'Tool #*: *' as num, tool_name
| stats count() as calls by tool_name
| sort calls desc

Cold starts per hour:

fields @timestamp
| filter @message like /Connection failed out to container health check/
| stats count() as cold_starts by bin(1h)
| sort cold_starts desc

Slowest sessions:

fields @timestamp, @message
| filter @message like /Invocation completed/
| parse @message '"message": "Invocation completed successfully (*s)", "logger": "*", "requestId": "*", "sessionId": "*"' as duration, logger, req_id, session_id
| sort duration desc
| limit 10

The “Connection failed out to container health check” Message

This appears exactly once per microVM cold boot. It’s the sidecar’s first TCP probe hitting the microVM before Uvicorn is fully listening. The sidecar retries with a proper GET /ping and succeeds.

Counting these messages = counting cold starts. If you see 50 of these in an hour, you had 50 cold microVM boots.

The otel-rt-logs Shared Stream Problem

All microVM instances write to the same otel-rt-logs stream. Events from different requests are interleaved. You MUST filter by traceId or session_id to isolate one request.

fields @timestamp, @message
| filter @logStream = "otel-rt-logs"
| filter @message like /"session.id":"YOUR_SESSION_ID"/
| sort @timestamp asc

Three Processes, Three Ports

PID 1: Sidecar (AWS-injected)            → :9000  (receives from NLB)
PID 2: OTel Collector (ADOT)             → :8000  (receives traces from your app)
PID 3: opentelemetry-instrument python   → :8080  (YOUR app via Uvicorn)
        └── Uvicorn → Starlette (BedrockAgentCoreApp)
            ├── POST /invocations  → your @app.entrypoint
            ├── GET  /ping         → health check
            └── WS   /ws           → websocket (unused in REST mode)

The opentelemetry-instrument wrapper automatically instruments all boto3 calls, captures LLM request/response messages, measures tool execution time, counts event loop cycles, and sends everything to the OTel collector on :8000.

The http.server_name Reveals AWS Internals

http.server_name: cell01.us-east-1.prod.arp.kepler-analytics.aws.dev

cell01             — The specific compute cell running your microVM
us-east-1          — AWS region
prod               — Production environment
arp                — Agent Runtime Platform (internal codename)
kepler-analytics   — Project Kepler (AgentCore's internal name)
.aws.dev           — AWS internal domain

This tells you which physical compute cell your microVM landed on. If one cell is consistently slower, it could indicate noisy-neighbor issues.

The Invalid HTTP Request Warning

WARNING: Invalid HTTP request received.

This appears on every cold start. The sidecar sends a raw TCP SYN to check if the port is open, before sending a proper HTTP GET /ping. Uvicorn sees the TCP data but can’t parse it as HTTP. It’s harmless — the sidecar immediately retries with a valid HTTP request. But if you see many of these in sequence (10+), it means the microVM is taking unusually long to boot.

The OTel Collector Can Crash Silently

The ADOT collector on :8000 is a separate process. If it crashes or runs out of memory:

  • Your agent still works (requests succeed)
  • You lose all metrics (no EMF, no traces)
  • CloudWatch shows gaps in the bedrock-agentcore namespace
  • The AWS/Bedrock-AgentCore namespace still works (measured outside)

How to detect: if you see invocations in the AWS/Bedrock-AgentCore namespace but NO corresponding events in otel-rt-logs, the OTel collector died.

Cost Per Request — Real Numbers

Simple "What is 2+2?" request:

HAIKU MODEL:
  LLM input:   3,574 tokens × $0.25/MTok  = $0.000894
  LLM output:      63 tokens × $1.25/MTok  = $0.000079
  Compute CPU: 4.6s × $0.0895/vCPU-hr      = $0.000114
  Compute Mem: 4.6s × $0.00945/GB-hr × 0.128GB = $0.000002
  ─────────────────────────────────────────────
  TOTAL:                                      $0.001089 (~$1.09/1000 requests)

SONNET MODEL:
  LLM input:   3,574 tokens × $3/MTok      = $0.010722
  LLM output:      63 tokens × $15/MTok     = $0.000945
  Compute (same):                             $0.000116
  ─────────────────────────────────────────────
  TOTAL:                                      $0.011783 (~$11.78/1000 requests)

Idle sessions truly cost $0. The microVM stays in memory (reserved by Firecracker) but CPU is suspended. AWS only bills when your code is actively executing.

Cold vs Warm vs Sticky — Real Production Numbers

From a 94-invocation benchmark:

┌──────────────────────────────────────────────────┐
│  Type     Avg     Min     Max      p50     p99    │
│  ──────   ─────   ─────   ──────   ─────   ────── │
│  COLD     3.406s  2.165s  14.849s  2.436s  14.220s│
│  WARM     2.797s  1.695s  3.769s   2.563s  3.379s │
│  STICKY   2.532s  1.305s  3.378s   2.435s  3.370s │
└──────────────────────────────────────────────────┘

p99: Cold 14.2s vs Warm 3.4s = 4x improvement!

The average improvement (18-26%) understates the real benefit. The p99 improvement (4x) matters more because those are the cold-start outliers that users actually feel.

Concurrent Scaling Behavior

5 concurrent cold starts:
  All 5 complete in 3.166s wall clock
  Individual: 2.1s - 3.2s range

5 concurrent warm hits:
  All 5 complete in 2.746s wall clock
  Individual: 1.7s - 2.7s range

AgentCore provisions microVMs in parallel. 5 simultaneous cold starts don’t serialize — they all boot at once. The wall clock time ≈ slowest individual request, not sum of all requests.

The Complete Request Flow (Annotated Timeline)

T+0.000s  Your code: invoke_agent_runtime()
T+0.050s  boto3 serializes request, signs with SigV4
T+0.100s  HTTPS to bedrock-agentcore.us-east-1.amazonaws.com
T+0.150s  AWS API Gateway receives request
T+0.200s  AgentCore control plane: session_id → microVM lookup
           IF new session: provision new Firecracker microVM (125ms)
           IF existing: route to existing microVM
T+0.300s  NLB forwards to sidecar on :9000
T+0.350s  Sidecar validates auth, injects headers
T+0.400s  Sidecar forwards to :8080/invocations
T+0.450s  BedrockAgentCoreApp._handle_invocation() called
T+0.460s  Your @app.entrypoint function starts
T+0.470s  Strands Agent event loop begins
T+0.480s  ── CYCLE 1 ──
T+0.490s    Build messages: system prompt + user input (1752 tokens)
T+0.500s    Call bedrock-runtime/model/invoke
T+0.510s      TTFT: model thinking... (2707ms)
T+3.207s      Model returns: tool_use(calculator, "2+2")
T+3.210s    Execute tool: calculator("2+2") → "Result: 4" (3ms)
T+3.213s  ── CYCLE 2 ──
T+3.220s    Build messages: system + user + assistant + tool_result (1822 tokens)
T+3.230s    Call bedrock-runtime again
T+3.240s      TTFT: model thinking... (1395ms)
T+4.635s      Model returns: end_turn "The answer is **4**."
T+4.640s  Event loop ends
T+4.650s  BedrockAgentCoreApp sends response
T+4.660s  Sidecar forwards response upstream
T+4.700s  OTel collector batches metrics, writes EMF to CloudWatch Logs
T+4.800s  Response reaches your boto3 client
T+5.544s  Your code finishes reading response stream

The Debugging Decision Tree

Agent is slow
├── WHERE is time spent?
│   ├── LLM inference (> 80%) ──── THE MOST COMMON CASE
│   │   ├── Too many cycles? (> 3)
│   │   │   ├── Simplify system prompt
│   │   │   ├── Remove unnecessary tools
│   │   │   └── Add "answer directly when possible" instruction
│   │   ├── Too many input tokens?
│   │   │   └── Use shorter tool responses
│   │   ├── Model too slow?
│   │   │   ├── Switch Opus → Sonnet → Haiku
│   │   │   ├── Try different AWS region
│   │   │   └── Use streaming for perceived speed
│   │   └── High TTFT? (> 3s)
│   │       ├── Model overloaded (try off-peak hours)
│   │       └── Too many tools registered (each adds ~100 tokens)
│   │
│   ├── Tool execution (> 20%)
│   │   ├── Which tool? (check strands.tool.duration)
│   │   ├── External API slow → connection pooling, caching
│   │   ├── Database slow → connection reuse, indexing
│   │   └── No timeout → add timeout (default 30s)
│   │
│   └── Cold start (first request only)
│       ├── Large Docker image → minimize image
│       ├── Heavy imports → lazy loading
│       ├── Model initialization → cache model objects
│       └── Pre-warm with warm pools
│
├── PATTERN?
│   ├── First request slow, rest fast → cold start
│   ├── Getting slower over time → token growth per cycle
│   └── All requests slow → check model, check region
│
└── HOW to investigate?
    ├── Quick: aws logs tail --follow (real-time)
    ├── Deep: OTel EMF metrics (per-component breakdown)
    ├── Historical: Logs Insights queries (aggregations)
    ├── Visual: CloudWatch GenAI dashboard (UI)
    └── Specific: Session forensics (debug one request)

Quick Reference — CLI Commands

# Real-time log tail
aws logs tail "/aws/bedrock-agentcore/runtimes/AGENT_ID-DEFAULT" --follow

# Filter for specific session
aws logs tail "/aws/bedrock-agentcore/runtimes/AGENT_ID-DEFAULT" \
  --filter-pattern "SESSION_ID" --since 1h

# Filter for errors only
aws logs tail "/aws/bedrock-agentcore/runtimes/AGENT_ID-DEFAULT" \
  --filter-pattern "Error" --since 1h