Monday, 16th February 2026
What I Learned Building a Streaming Agent on AWS Bedrock AgentCore Runtime
What I Learned Building a Streaming Agent on AWS Bedrock AgentCore Runtime
I spent a week building a conversational agent on AWS Bedrock AgentCore Runtime. It supports hybrid memory (fast in-session + persistent cross-session), real-time token streaming, multi-user isolation, and a 12-test automated suite. Here’s everything I wish someone had told me before I started.
The full source code is available on GitHub: github.com/avparkhi/AWS-BedrockAgentCore-Testing
1. The MicroVM Mental Model
AgentCore doesn’t run your agent in a container you manage. It runs inside microVMs — tiny, isolated virtual machines that spin up per-session. The first request to a new session triggers a cold start (~1.5–10s depending on what you initialize). Subsequent requests to the same session hit a warm microVM where all your Python globals are still in memory (~180ms overhead).
This means you can cache expensive things — model clients, config, database connections — as module-level globals, and they’ll persist across warm invocations. But the moment a user creates a new session, a new microVM spins up and everything resets.
Think of it like AWS Lambda, but with a session-sticky routing layer on top.
2. Memory is a Two-Tier Problem
The session-sticky microVM model creates an interesting memory challenge:
- Within a session (warm starts): You want fast, in-process memory. A simple Python dictionary works — zero latency, no API calls. This is your RAM layer.
- Across sessions (cold starts, redeploys, crashes): You need persistent storage. AgentCore provides a Memory API for this — durable, survives everything, but adds ~300ms per read.
The hybrid approach: always save to durable memory (background, on every message). On warm starts, use RAM. On cold starts, load from durable memory. The user never notices the difference.
3. Streaming Changes the User Experience Dramatically
Without streaming, a typical warm chat response takes ~5.4 seconds. The user sees nothing until the entire response is ready. With streaming, the first token appears in ~1–1.5 seconds. The total time is the same, but the perceived latency drops by 75%.
| Metric | Non-Streaming | Streaming |
|---|---|---|
| Time to first visible output | ~5.4s | ~1–1.5s |
| Total response time | ~5.4s | ~5.4s |
| User perception | “Is it frozen?” | “It’s thinking and typing” |
If your agent takes more than 2 seconds to respond, streaming isn’t a nice-to-have — it’s table stakes for user experience.
4. The SDK Has Opinions About Streaming (And They’re Inconvenient)
The AgentCore SDK supports streaming on the server side beautifully — if your entrypoint returns a Python generator, the SDK automatically detects it and wraps the response as Server-Sent Events (SSE). No configuration needed.
The problem is on the client side. The SDK’s built-in invoke method handles streaming by printing chunks directly to the console. It then returns an empty dictionary. There is no programmatic access to the streamed content.
If you want to actually parse streaming responses in your own client — show metadata, measure time-to-first-token, display formatted output — you need to bypass the SDK and call the API directly via boto3. It’s not hard, but it’s undocumented and unexpected.
5. The Five Gotchas
These are the things that cost me the most time and are not documented anywhere I could find.
Gotcha #1: The Payload Isn’t a Real Dict
The payload your entrypoint receives looks like a Python dictionary, but it’s actually a JSONSerializableDict. It behaves differently in subtle ways:
- The two-argument form of
.get(key, default)throws a TypeError. Only the single-argument form works. - Bracket access (
payload["key"]) throws a TypeError — no subscript support. - Assignment (
payload["key"] = value) also fails.
The workaround is to use single-argument .get() with an or fallback for defaults. It’s ugly, but it works reliably.
The same limitation applies to agent state objects in the Strands SDK. Don’t try to use them as dicts — store your state elsewhere.
Gotcha #2: Return Values Get Double-Encoded
The SDK JSON-encodes whatever your entrypoint returns. If you return a JSON string, the client receives a JSON-encoded JSON string — with escaped quotes, escaped newlines, the works. You end up needing to iteratively decode the response (sometimes 3–4 rounds) to get the original data.
I ended up using a custom tag format for metadata instead of JSON, just to avoid the encoding nightmare.
Gotcha #3: Streaming Generators Have the Same Encoding Trap
This was the most frustrating bug. When your streaming generator yields values, the SDK JSON-encodes each one for the SSE wire format. If you pre-encode your dicts to JSON strings before yielding, the SDK encodes them again. The client receives a quoted string instead of a parseable dict.
The symptom: metadata objects appear as raw text inline with the chat response, and all metadata fields show as “unknown.”
The fix: yield raw Python objects (dicts, strings) and let the SDK handle serialization. Never call json.dumps() on values you’re about to yield from a streaming generator.
Gotcha #4: Non-Streaming Responses Silently Truncate at 1024 Bytes
There is no error. There is no warning. If your non-streaming response exceeds 1024 bytes, it gets cut off mid-JSON. You spend an hour debugging your JSON parser before realizing the data was simply incomplete.
Keep non-streaming responses compact. For my memory query mode, I had to truncate each conversation turn to ~80 characters to stay under the limit.
Gotcha #5: A Generator Function Can’t Conditionally Stream
In Python, a function that contains yield anywhere in its body always returns a generator — you can’t conditionally return a string vs. yield chunks from the same function.
The solution is a dispatch pattern: your entrypoint is a regular function that returns either a generator object (for streaming) or a string (for non-streaming). The SDK inspects the return value’s type, not the function itself. Two separate handler functions, one thin dispatcher.
6. Testing Observations
Test Through the Non-Streaming Path
All 12 of my automated tests use non-streaming mode. Streaming responses are harder to assert against (you’d need to reconstruct the full text from chunks). Non-streaming gives you a clean, complete response string to validate.
Streaming is a transport concern. If your logic works non-streaming, it works streaming — the same code runs either way, just yielded differently.
Memory Persistence Is the Key Test
The most important test: send a math problem on session A, then ask about the result on a completely different session B. Session B has empty RAM (new microVM). If the agent remembers the answer, durable memory is working. If the response attributes the memory source as “durable,” the hybrid architecture is working correctly.
Isolation Tests Are Easy to Get Wrong
Testing that User 2 doesn’t see User 1’s history is tricky because LLMs can coincidentally mention the same numbers. Checking that the response doesn’t contain both “50” and “48” (from User 1’s specific calculation chain) is more robust than checking for either alone. Even so, these tests can be flaky.
Observed Latencies
| Operation | Cold Start | Warm |
|---|---|---|
| Ping (no LLM, pure microVM overhead) | ~1.7s | ~180ms |
| Chat (LLM + memory hooks) | ~8–17s | ~5.4s |
| Chat streaming (time to first token) | — | ~1–1.5s |
| Memory query (durable read, no LLM) | — | ~350ms |
The wide range on cold-start chat (8–17s) is due to both microVM provisioning and the first-time model/memory client initialization. Subsequent cold starts are faster (~8s) because ECR image layers are cached.
7. Observability Setup
AgentCore supports three observability pillars, all routed through the runtime role’s IAM permissions:
| Pillar | Where It Goes | What You See |
|---|---|---|
| Logs | CloudWatch Logs | Two streams: runtime-logs (your print output) and otel-rt-logs (OpenTelemetry) |
| Traces | X-Ray via OTel | End-to-end request traces with timing breakdown |
| Metrics | CloudWatch Metrics | Invocation count, latency, errors under the bedrock-agentcore namespace |
The runtime role needs permissions for all three: log group/stream creation and writes, X-Ray segment and telemetry submission, and CloudWatch metric publishing scoped to the bedrock-agentcore namespace.
The GenAI Observability Dashboard in the CloudWatch console provides a unified view across all your AgentCore agents — worth bookmarking.
8. Deployment Notes
- CodeBuild is the default and easiest path. It builds ARM64 containers in the cloud — no local Docker required. Build time is ~40 seconds.
- Use your venv Python. On macOS, the system Python is typically 3.9.x. The SDK needs 3.11+. I wasted time on cryptic import errors before realizing I was using the wrong interpreter.
- Memory resources are idempotent-ish. Creating a memory that already exists throws a ValidationException. You need to catch it and look up the existing one via list. Not a big deal, but not obvious from the API.
- The agent ARN lives in the YAML config (
.bedrock_agentcore.yaml), not in your deploy info JSON. You need it for direct boto3 calls. The SDK handles this internally when you use its invoke method. - Redeployment is seamless. Update your code, run deploy, the agent updates in place. Session IDs reset, but durable memory persists. Warm microVMs from the old version drain naturally.
9. What I’d Do Differently Next Time
- Build non-streaming first, add streaming last. Streaming is purely a transport layer concern. Get your agent logic, memory, and tools working with simple request/response, then layer streaming on top. Debugging streaming issues while your core logic is also broken is miserable.
- Assume everything the SDK touches will be JSON-encoded. Return values, streaming yields, headers — if the SDK handles it, expect encoding. Design your data formats around this from day one.
- Keep non-streaming responses tiny. The 1024-byte truncation limit means you should design response formats to be compact from the start, not try to shrink them after the fact.
- Write the ping test first. A no-LLM ping mode that returns cache status and invocation count is invaluable for debugging microVM behavior. It isolates platform issues from agent logic issues. Every new agent I build will start with this.
- Use tags, not JSON, for non-streaming metadata. A simple delimited format like
<<META:key=val,key=val>>survives the SDK’s encoding gauntlet intact. JSON metadata in non-streaming responses is a losing battle.