The 95% Rule: Why Your Agent Is Slow and How to Prove It
12th March 2026
Your agent takes 5 seconds to respond. Where did those 5 seconds go? AgentCore gives you 6 observability layers, 30 hidden metrics, and a debugging decision tree — but you have to know where to look. Here’s everything you can’t see by just reading the code.
The 6 Layers of Observability
AgentCore gives you 6 distinct observability layers, each revealing different things:
Layer 1: CLIENT-SIDE TIMING
You measure this yourself (time.time() around invoke_agent_runtime)
Shows: Total end-to-end latency including network
Blind spot: Can't see what's happening inside
Layer 2: RUNTIME LOGS (CloudWatch Logs → [runtime-logs] streams)
Your print() statements + bedrock_agentcore framework logs
Shows: Request arrival, tool calls, completion time, errors
Blind spot: No per-component breakdown
Layer 3: OTEL TRACE EVENTS (CloudWatch Logs → otel-rt-logs stream)
Every message in the LLM conversation
Shows: System prompt, user input, LLM response, tool calls, tool results
Blind spot: No timing (just message content)
Layer 4: OTEL EMF METRICS (CloudWatch Logs → otel-rt stream)
Embedded Metric Format — auto-extracted into CloudWatch Metrics
Shows: Per-request LLM duration, tool duration, token counts, TTFT
Blind spot: Aggregated per-request (no per-message timing)
Layer 5: AWS/Bedrock-AgentCore METRICS (CloudWatch Metrics namespace)
AWS-measured metrics from OUTSIDE the microVM
Shows: End-to-end latency with percentiles, errors, throttles, billing
Blind spot: No inside-the-VM breakdown
Layer 6: CLOUDWATCH LOGS INSIGHTS (query engine)
SQL-like queries across all log streams
Shows: Aggregations, patterns, statistics across all invocations
Blind spot: Query syntax is limited, 5-second minimum delay
The Sidecar Tax — The Time You Can Never See
Every request passes through the sidecar (port 9000) before reaching your code (port 8080). The sidecar adds 50-200ms for TLS termination, auth token validation, session ID → microVM routing lookup, request serialization, and HTTP forwarding to :8080.
Sidecar tax = Client total time - http.server.duration (EMF metric)
For our test: 5.544s (client) - 4.615s (http.server.duration) = 0.929s sidecar + network
On cold starts, this includes Firecracker microVM boot (125ms) + Python startup + your imports.
Two Log Streams, Completely Different Data
Log Group: /aws/bedrock-agentcore/runtimes/{AGENT_ID}-DEFAULT
│
├── 2026/03/12/[runtime-logs]ed8b8c65-... ← MicroVM instance #1
├── 2026/03/12/[runtime-logs]375e9614-... ← MicroVM instance #2
├── 2026/03/12/[runtime-logs]212edc45-... ← MicroVM instance #3
│ ... (one stream per microVM that ever existed)
│
└── otel-rt-logs ← ALL OTel data (shared stream)
The UUID in [runtime-logs]<uuid> IS the Firecracker microVM instance ID. If you see the same UUID handling multiple requests, those requests hit the same warm microVM (sticky session working). If you see different UUIDs, those were different microVMs (cold starts or load balancing).
Embedded Metric Format (EMF) — Metrics Without put_metric_data
OTel logs contain _aws.CloudWatchMetrics JSON blocks. CloudWatch automatically extracts these into metrics without you calling put_metric_data():
{
"_aws": {
"Timestamp": 1773335423274,
"CloudWatchMetrics": [{
"Namespace": "bedrock-agentcore",
"Metrics": [{"Name": "strands.tool.duration", "Unit": "Seconds"}],
"Dimensions": [["tool_name", "tool_use_id"]]
}]
},
"strands.tool.duration": {"Values": [0.003], "Counts": [1]},
"tool_name": "calculator",
"tool_use_id": "tooluse_vEjG3idNjMdOhbBd3peHaL"
}
The OTel collector on port 8000 inside the microVM receives traces from opentelemetry-instrument, converts them to EMF, and writes them to CloudWatch Logs. CloudWatch then auto-extracts the metrics.
Trace ID = Your Request’s DNA
Every OTel event has a traceId field. All events from the same invoke_agent_runtime() call share the same traceId. The spanId changes per operation:
traceId: 69b2f37963a139ff1d6114ea6b800056 (one per request)
├── spanId: f9aff898 → gen_ai.system.message (LLM call #1 start)
├── spanId: f9aff898 → gen_ai.user.message
├── spanId: f9aff898 → gen_ai.choice (tool_use)
├── spanId: 91da9ac0 → strands.telemetry.tracer (cycle #1 end)
├── spanId: 8cba081d → strands.telemetry.tracer (tool result)
├── spanId: 57695a94 → gen_ai.system.message (LLM call #2 start)
├── spanId: 57695a94 → gen_ai.choice (end_turn)
├── spanId: f2f863700 → strands.telemetry.tracer (cycle #2 end)
├── spanId: ee60336f → strands.telemetry.tracer (agent complete)
└── spanId: 9f9f5122 → bedrock_agentcore.app "Invocation completed (4.613s)"
To debug a specific slow request: grep for the session_id in OTel logs, get the traceId, then filter ALL OTel events by that traceId.
The Event Loop Is The Agent’s Heartbeat
User prompt arrives
│
▼
┌─ CYCLE 1 ─────────────────────────────────┐
│ 1. Build messages (system prompt + input) │
│ 2. Call LLM (Bedrock) │ ← Most time here
│ 3. LLM returns: tool_use or end_turn │
│ 4. If tool_use: execute tool │ ← Second most time
│ 5. Append tool result to messages │
└─────────────────────────────────────────────┘
│ (if tool_use, loop back)
▼
┌─ CYCLE 2 ─────────────────────────────────┐
│ 1. Build messages (now includes cycle 1) │
│ 2. Call LLM again │ ← Context now LARGER
│ 3. LLM returns: end_turn │
└────────────────────────────────────────────┘
│
▼
Return response to user
Each cycle is measured by strands.event_loop.cycle_count and strands.event_loop.latency.
Token Growth — The Silent Performance Killer
Simple request: "What is 2+2?" (2 cycles)
Cycle 1: 1752 input tokens → 44 output (tool_use) = 2.916s
Cycle 2: 1822 input tokens → 54 output (final text) = 1.489s
Token growth: +70 tokens (+4%)
Complex request: "15*37, add 42, tell me the time" (2 cycles)
Cycle 1: 1771 input tokens → 100 output (tool_use) = 3.072s
Cycle 2: 1952 input tokens → 117 output (final text) = 3.451s
Token growth: +181 tokens (+10%)
Why it matters: if your agent runs 10 cycles, input tokens grow with every cycle. Cycle 1 might process ~1,750 tokens in ~1.5s, but cycle 10 processes ~5,000 tokens in ~4.0s. Ten cycles with growing latency = ~30 seconds just for LLM calls. This is the #1 cause of “my agent takes 2 minutes.”
Time-to-First-Token (TTFT) vs Total Duration
gen_ai.client.operation.duration = TTFT + token streaming
↑ ↑
LLM thinking generating output
For tool_use responses (short):
TTFT: 2707ms → Total: 2916ms → Streaming: 209ms (7%)
For text responses (longer):
TTFT: 2662ms → Total: 3455ms → Streaming: 793ms (23%)
High TTFT means the model is thinking longer. Causes: large input context, complex reasoning required, model overloaded (try different region), or using a larger model (Opus > Sonnet > Haiku).
The 95% Rule
Real measurement from a “What is 2+2?” request:
Total inside microVM: 4,613ms (100%)
├── LLM call #1: 2,916ms (63%)
├── LLM call #2: 1,489ms (32%)
├── Tool (calculator): 3ms (0.07%)
└── Overhead: 205ms (4.4%)
LLM TOTAL: 4,405ms (95.5%)
95% of time is LLM inference. This is typical for agents with fast tools. The implication:
- Optimizing your tool code = marginal gains
- Switching from Sonnet to Haiku = 2-5x improvement
- Reducing input tokens by 50% = ~30% improvement
- Reducing cycles from 5 to 2 = ~60% improvement
Tool Duration Reveals External Dependencies
calculator: 0.003s (pure computation — instant)
weather: 0.100s (HTTP call to weather API)
database: 1.200s (connection + query + serialization)
If you see a tool taking > 1 second, it’s calling an external service. Fix: connection pooling, caching, timeouts, parallel execution.
Two Namespaces, Two Perspectives
AWS/Bedrock-AgentCore (AWS-side — outside the microVM)
├── Latency ← End-to-end including sidecar (what your user feels)
├── Invocations ← Count of invoke_agent_runtime() calls
├── Sessions ← Count of NEW sessions created
├── Errors ← SystemErrors + UserErrors
├── Throttles ← Rate limit exceeded
├── CPUUsed-vCPUHours ← BILLING: CPU usage
└── MemoryUsed-GBHours ← BILLING: Memory usage
bedrock-agentcore (OTel — inside the microVM)
├── http.server.duration ← Time inside your code
├── gen_ai.client.token.usage ← Token counts
├── strands.event_loop.* ← Event loop metrics
├── strands.tool.* ← Tool metrics
└── strands.model.time_to_first_token ← LLM thinking time
Percentiles Tell the Real Story
From a 58-invocation test batch:
Latency (AWS/Bedrock-AgentCore namespace):
avg: 3,082ms
min: 1,305ms
max: 14,849ms ← 5x slower than average!
p50: 2,436ms ← Typical request
p90: 3,770ms ← 90% of requests finish by here
p99: 14,220ms ← Worst 1% — likely cold starts
p50 (2.4s) vs p99 (14.2s) = 6x difference. The p99 outlier is almost certainly a cold start. If you only look at averages, you miss this entirely.
Sessions vs Invocations
58 invocations but only 31 new sessions. That means 27 requests (47%) hit existing warm sessions — proving sticky routing works. The more your Sessions/Invocations ratio drops, the more you’re benefiting from warm microVMs.
Throttles, Errors, and Hidden Retries
SystemErrors → AWS infrastructure issue. Nothing you can do. Wait and retry.
UserErrors → Your @app.entrypoint threw an exception. Check runtime logs.
Throttles → You hit a rate limit. Request increase via Service Quotas.
Hidden: the boto3 client has built-in retry with exponential backoff. A single throttle from AWS may result in 2-3 actual API calls before succeeding. Your client-side timing includes retry time, but the AWS Latency metric only counts the final successful attempt.
CPU and Memory Billing — Real Numbers
CPUUsed-vCPUHours: 0.004628 → $0.000414 (@ $0.0895/vCPU-hr)
MemoryUsed-GBHours: 0.007257 → $0.000069 (@ $0.00945/GB-hr)
Key points:
CPU is charged only when your code is executing (not idle)
Memory is charged for 128MB minimum
Idle sessions = $0 (confirmed in 10-minute idle test)
Billing is per-second granular, aggregated to hourly metrics
One Log Stream = One MicroVM’s Entire Life
2026/03/12/[runtime-logs]ed8b8c65-d8ce-4287-a67d-8d464523db53
This stream contains ALL logs from microVM ed8b8c65 from boot to termination. If this VM handled 10 requests, all 10 appear in this stream. When the VM is terminated (idle timeout or explicit stop), no more logs appear.
Forensic trick — count how many requests a specific microVM handled:
aws logs filter-log-events \
--log-group-name "/aws/bedrock-agentcore/runtimes/AGENT_ID-DEFAULT" \
--log-stream-names "2026/03/12/[runtime-logs]ed8b8c65-..." \
--filter-pattern "Invocation completed"
CloudWatch Logs Insights — The Power Queries
Duration percentiles across all invocations:
fields @timestamp, @message
| filter @message like /Invocation completed/
| parse @message '"message": "Invocation completed successfully (*s)"' as duration
| stats count() as n,
avg(duration) as avg_s,
pct(duration, 50) as p50,
pct(duration, 90) as p90,
pct(duration, 99) as p99,
max(duration) as max_s
Tool usage frequency:
fields @message
| filter @message like /^Tool #/
| parse @message 'Tool #*: *' as num, tool_name
| stats count() as calls by tool_name
| sort calls desc
Cold starts per hour:
fields @timestamp
| filter @message like /Connection failed out to container health check/
| stats count() as cold_starts by bin(1h)
| sort cold_starts desc
Slowest sessions:
fields @timestamp, @message
| filter @message like /Invocation completed/
| parse @message '"message": "Invocation completed successfully (*s)", "logger": "*", "requestId": "*", "sessionId": "*"' as duration, logger, req_id, session_id
| sort duration desc
| limit 10
The “Connection failed out to container health check” Message
This appears exactly once per microVM cold boot. It’s the sidecar’s first TCP probe hitting the microVM before Uvicorn is fully listening. The sidecar retries with a proper GET /ping and succeeds.
Counting these messages = counting cold starts. If you see 50 of these in an hour, you had 50 cold microVM boots.
The otel-rt-logs Shared Stream Problem
All microVM instances write to the same otel-rt-logs stream. Events from different requests are interleaved. You MUST filter by traceId or session_id to isolate one request.
fields @timestamp, @message
| filter @logStream = "otel-rt-logs"
| filter @message like /"session.id":"YOUR_SESSION_ID"/
| sort @timestamp asc
Three Processes, Three Ports
PID 1: Sidecar (AWS-injected) → :9000 (receives from NLB)
PID 2: OTel Collector (ADOT) → :8000 (receives traces from your app)
PID 3: opentelemetry-instrument python → :8080 (YOUR app via Uvicorn)
└── Uvicorn → Starlette (BedrockAgentCoreApp)
├── POST /invocations → your @app.entrypoint
├── GET /ping → health check
└── WS /ws → websocket (unused in REST mode)
The opentelemetry-instrument wrapper automatically instruments all boto3 calls, captures LLM request/response messages, measures tool execution time, counts event loop cycles, and sends everything to the OTel collector on :8000.
The http.server_name Reveals AWS Internals
http.server_name: cell01.us-east-1.prod.arp.kepler-analytics.aws.dev
cell01 — The specific compute cell running your microVM
us-east-1 — AWS region
prod — Production environment
arp — Agent Runtime Platform (internal codename)
kepler-analytics — Project Kepler (AgentCore's internal name)
.aws.dev — AWS internal domain
This tells you which physical compute cell your microVM landed on. If one cell is consistently slower, it could indicate noisy-neighbor issues.
The Invalid HTTP Request Warning
WARNING: Invalid HTTP request received.
This appears on every cold start. The sidecar sends a raw TCP SYN to check if the port is open, before sending a proper HTTP GET /ping. Uvicorn sees the TCP data but can’t parse it as HTTP. It’s harmless — the sidecar immediately retries with a valid HTTP request. But if you see many of these in sequence (10+), it means the microVM is taking unusually long to boot.
The OTel Collector Can Crash Silently
The ADOT collector on :8000 is a separate process. If it crashes or runs out of memory:
- Your agent still works (requests succeed)
- You lose all metrics (no EMF, no traces)
- CloudWatch shows gaps in the
bedrock-agentcorenamespace - The
AWS/Bedrock-AgentCorenamespace still works (measured outside)
How to detect: if you see invocations in the AWS/Bedrock-AgentCore namespace but NO corresponding events in otel-rt-logs, the OTel collector died.
Cost Per Request — Real Numbers
Simple "What is 2+2?" request:
HAIKU MODEL:
LLM input: 3,574 tokens × $0.25/MTok = $0.000894
LLM output: 63 tokens × $1.25/MTok = $0.000079
Compute CPU: 4.6s × $0.0895/vCPU-hr = $0.000114
Compute Mem: 4.6s × $0.00945/GB-hr × 0.128GB = $0.000002
─────────────────────────────────────────────
TOTAL: $0.001089 (~$1.09/1000 requests)
SONNET MODEL:
LLM input: 3,574 tokens × $3/MTok = $0.010722
LLM output: 63 tokens × $15/MTok = $0.000945
Compute (same): $0.000116
─────────────────────────────────────────────
TOTAL: $0.011783 (~$11.78/1000 requests)
Idle sessions truly cost $0. The microVM stays in memory (reserved by Firecracker) but CPU is suspended. AWS only bills when your code is actively executing.
Cold vs Warm vs Sticky — Real Production Numbers
From a 94-invocation benchmark:
┌──────────────────────────────────────────────────┐
│ Type Avg Min Max p50 p99 │
│ ────── ───── ───── ────── ───── ────── │
│ COLD 3.406s 2.165s 14.849s 2.436s 14.220s│
│ WARM 2.797s 1.695s 3.769s 2.563s 3.379s │
│ STICKY 2.532s 1.305s 3.378s 2.435s 3.370s │
└──────────────────────────────────────────────────┘
p99: Cold 14.2s vs Warm 3.4s = 4x improvement!
The average improvement (18-26%) understates the real benefit. The p99 improvement (4x) matters more because those are the cold-start outliers that users actually feel.
Concurrent Scaling Behavior
5 concurrent cold starts:
All 5 complete in 3.166s wall clock
Individual: 2.1s - 3.2s range
5 concurrent warm hits:
All 5 complete in 2.746s wall clock
Individual: 1.7s - 2.7s range
AgentCore provisions microVMs in parallel. 5 simultaneous cold starts don’t serialize — they all boot at once. The wall clock time ≈ slowest individual request, not sum of all requests.
The Complete Request Flow (Annotated Timeline)
T+0.000s Your code: invoke_agent_runtime()
T+0.050s boto3 serializes request, signs with SigV4
T+0.100s HTTPS to bedrock-agentcore.us-east-1.amazonaws.com
T+0.150s AWS API Gateway receives request
T+0.200s AgentCore control plane: session_id → microVM lookup
IF new session: provision new Firecracker microVM (125ms)
IF existing: route to existing microVM
T+0.300s NLB forwards to sidecar on :9000
T+0.350s Sidecar validates auth, injects headers
T+0.400s Sidecar forwards to :8080/invocations
T+0.450s BedrockAgentCoreApp._handle_invocation() called
T+0.460s Your @app.entrypoint function starts
T+0.470s Strands Agent event loop begins
T+0.480s ── CYCLE 1 ──
T+0.490s Build messages: system prompt + user input (1752 tokens)
T+0.500s Call bedrock-runtime/model/invoke
T+0.510s TTFT: model thinking... (2707ms)
T+3.207s Model returns: tool_use(calculator, "2+2")
T+3.210s Execute tool: calculator("2+2") → "Result: 4" (3ms)
T+3.213s ── CYCLE 2 ──
T+3.220s Build messages: system + user + assistant + tool_result (1822 tokens)
T+3.230s Call bedrock-runtime again
T+3.240s TTFT: model thinking... (1395ms)
T+4.635s Model returns: end_turn "The answer is **4**."
T+4.640s Event loop ends
T+4.650s BedrockAgentCoreApp sends response
T+4.660s Sidecar forwards response upstream
T+4.700s OTel collector batches metrics, writes EMF to CloudWatch Logs
T+4.800s Response reaches your boto3 client
T+5.544s Your code finishes reading response stream
The Debugging Decision Tree
Agent is slow
├── WHERE is time spent?
│ ├── LLM inference (> 80%) ──── THE MOST COMMON CASE
│ │ ├── Too many cycles? (> 3)
│ │ │ ├── Simplify system prompt
│ │ │ ├── Remove unnecessary tools
│ │ │ └── Add "answer directly when possible" instruction
│ │ ├── Too many input tokens?
│ │ │ └── Use shorter tool responses
│ │ ├── Model too slow?
│ │ │ ├── Switch Opus → Sonnet → Haiku
│ │ │ ├── Try different AWS region
│ │ │ └── Use streaming for perceived speed
│ │ └── High TTFT? (> 3s)
│ │ ├── Model overloaded (try off-peak hours)
│ │ └── Too many tools registered (each adds ~100 tokens)
│ │
│ ├── Tool execution (> 20%)
│ │ ├── Which tool? (check strands.tool.duration)
│ │ ├── External API slow → connection pooling, caching
│ │ ├── Database slow → connection reuse, indexing
│ │ └── No timeout → add timeout (default 30s)
│ │
│ └── Cold start (first request only)
│ ├── Large Docker image → minimize image
│ ├── Heavy imports → lazy loading
│ ├── Model initialization → cache model objects
│ └── Pre-warm with warm pools
│
├── PATTERN?
│ ├── First request slow, rest fast → cold start
│ ├── Getting slower over time → token growth per cycle
│ └── All requests slow → check model, check region
│
└── HOW to investigate?
├── Quick: aws logs tail --follow (real-time)
├── Deep: OTel EMF metrics (per-component breakdown)
├── Historical: Logs Insights queries (aggregations)
├── Visual: CloudWatch GenAI dashboard (UI)
└── Specific: Session forensics (debug one request)
Quick Reference — CLI Commands
# Real-time log tail
aws logs tail "/aws/bedrock-agentcore/runtimes/AGENT_ID-DEFAULT" --follow
# Filter for specific session
aws logs tail "/aws/bedrock-agentcore/runtimes/AGENT_ID-DEFAULT" \
--filter-pattern "SESSION_ID" --since 1h
# Filter for errors only
aws logs tail "/aws/bedrock-agentcore/runtimes/AGENT_ID-DEFAULT" \
--filter-pattern "Error" --since 1h
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026