The Agent Loop Iceberg — 10 Hard Problems Hiding Beneath the Simple Loop
15th March 2026
The basic agent loop — LLM call, tool execution, observe result, repeat — is maybe 10% of a production agent’s code. The other 90% is making it reliable, resumable, extensible, and production-grade. After tracing through real agent source code, here are the ten hard problems hiding beneath the surface that nobody shows you in tutorials.
The Happy Path Everyone Shows You
while True:
response = llm.call(messages)
if response.has_tool_call:
result = execute_tool(response.tool_call)
messages.append(result)
else:
return response.text
This works in demos. It breaks in production. Here’s what’s underneath.
1. Context Window Is Finite — What Happens When It Fills Up?
The basic loop assumes infinite memory. In reality:
Turn 1: User msg + Assistant response + Tool results = 2K tokens
Turn 5: All accumulated messages = 15K tokens
Turn 20: All accumulated messages = 80K tokens
Turn 35: BOOM — context overflow, API rejects the call
Production agents implement automatic compaction. When context approaches the limit:
1. Pick a "cut point" in message history
2. Send old messages to LLM: "Summarize what happened"
3. Replace everything before cut point with that summary
4. Track which files were read/modified (so the agent doesn't lose awareness)
The hidden complexity:
When do you compact?
Too early = lose important context
Too late = overflow error
Two triggers needed:
Soft threshold → proactive compaction (before it's urgent)
Hard overflow → reactive compaction with auto-retry (emergency)
This is the same context rot problem from autoresearch, but solved differently. Autoresearch avoids it by being stateless. Long-running interactive agents can’t be stateless — they must manage the window actively.
2. Errors Don’t Mean “Stop” — They Mean “Wait and Retry”
Your mental model: LLM responds or fails. Reality:
API call → 429 Rate Limited (wait 30s, retry)
API call → 502 Bad Gateway (wait 2s, retry)
API call → 503 Overloaded (wait 4s, retry)
API call → Context overflow (compact, then retry)
API call → Success!
Production agents classify errors and handle each differently:
Retryable errors (429, 5xx, connection errors):
→ Exponential backoff: 1s → 2s → 4s → 8s → 16s
→ Up to N retries, then surface to user
Context overflow:
→ Don't retry blindly
→ Compact first, THEN retry
→ This is a different recovery path, not just "try again"
Client errors (400, auth failures):
→ Surface to user immediately, no retry
→ Retrying these wastes time and tokens
Without error classification, your agent dies on the first rate limit.
3. Users Don’t Wait — Steering and Queuing
Basic model: user sends message, waits for full response, sends next message. Reality: users want to interrupt or redirect mid-stream.
User: "Refactor the auth module"
Agent: [streaming... reading files... calling tools...]
User: "Actually, skip the tests, just do the main code" ← WHILE AGENT IS RUNNING
Production agents handle this with two queue types:
Steer:
Interrupt NOW, inject message into current turn
Agent sees the new instruction before its next tool call
Used for: corrections, redirections, "stop doing that"
Follow-up:
Wait until agent finishes, then automatically send
Agent completes current task, then starts the queued one
Used for: "after that, also do X"
This is invisible to the user but critical for interactive agents. Without it, you either block all input during processing (bad UX) or lose messages (worse UX).
4. The System Prompt Is Dynamic, Not Static
Basic model: one fixed system prompt. Reality:
system_prompt = base_instructions
system_prompt += tool_descriptions # Changes if tools added/removed
system_prompt += tool_guidelines # Per-tool usage hints
system_prompt += project_context # CLAUDE.md files from cwd
system_prompt += skills_available # Dynamically discovered
system_prompt += extension_injections # Plugins modify it
system_prompt += f"Current date: {now}"
system_prompt += f"CWD: {cwd}"
The system prompt is rebuilt before every LLM invocation. Extensions can modify it via hooks. This means the agent’s behavior changes based on what project you’re in, what extensions are loaded, and what tools are registered — all without changing the core agent code.
5. Tool Results Need Processing, Not Just Passing Through
Basic model: tool returns string, send to LLM. Reality: tool output is messy, dangerous, and unbounded.
Bash output problems:
Binary garbage (reading a .png with cat) → must sanitize
ANSI escape codes (colors, cursor) → must strip
Output too large (10MB log file) → must truncate
Output still streaming (long command) → must stream to UI AND collect for LLM
Processing pipeline:
Raw output
→ strip ANSI escape codes
→ detect and remove binary content
→ if > 64KB: write to temp file, truncate for LLM, include path to full output
→ stream chunks to UI in real-time
→ on completion: return truncated result + exit code + truncation flag
File read problems:
File too large → truncate with "[truncated]" indicator
Image file → resize and encode as base64 for multimodal LLMs
Binary file → reject gracefully with descriptive error
Without this pipeline, one cat /dev/urandom crashes your agent or burns your entire context window on garbage.
6. Persistence — Sessions Are Not Just Chat History
Basic model: conversation lives in memory, gone when process dies. Production agents persist everything to disk:
Every message appended to JSONL with tree structure:
{"type":"message","id":"m1","parentId":null,"message":{...}}
{"type":"message","id":"m2","parentId":"m1","message":{...}}
{"type":"compaction","id":"c1","summary":"...","firstKeptEntryId":"m2"}
Why a tree structure instead of a flat list? Because of branching:
m1 → m2 → m3 → m4 (original conversation)
↘ m5 → m6 (user went back and tried different approach)
You can fork a conversation at any point and explore alternatives. The JSONL log is append-only — nothing is ever deleted, just new branches created. Compaction summaries are stored inline so you can resume a session that was compacted weeks ago.
7. The Extension/Hook System — Every Event Is Interceptable
Basic model: monolithic loop. Production agents expose 20+ hook points where external code can intervene:
Hook Point What It Does
─────────────────────────────────────────────────────────
input Transform/block user input before LLM sees it
before_agent_start Inject messages, modify system prompt
tool_execution_start Approve/deny tool calls (permission system!)
tool_execution_end Transform tool results
message_end React to LLM output
agent_end Post-processing
session_before_compact Custom compaction strategy
This is how you build entire subsystems without modifying core agent code:
Permission systems → hook into tool_execution_start, ask user before running bash
Logging/telemetry → hook into every event, record tool calls and latency
Custom tools → register new tools at runtime via before_agent_start
Guardrails → hook into input, block dangerous prompts
Skills/plugins → inject capabilities via extension hooks
8. Event Queue Serialization — Race Conditions Are Real
Basic model: process events as they come. Reality: events arrive asynchronously from the streaming API and must be processed in order.
// WRONG — race condition
agent.on("event", async (e) => {
await saveToFile(e) // What if two events fire before first save completes?
await updateUI(e) // Events processed out of order → corrupted session
})
// RIGHT — chain promises
handleEvent(event) {
this.eventQueue = this.eventQueue.then(() => processEvent(event))
}
// Each event waits for the previous one to complete
// Order is guaranteed. No corruption. No lost messages.
Without event serialization, you get corrupted session files, UI glitches, and lost messages. This is a classic concurrency bug that’s invisible in demos (where events are slow) and catastrophic in production (where events arrive in bursts).
9. Abort Is Harder Than You Think
Basic model: cancel = stop. Reality: you need to cancel many things simultaneously:
Agent running → user hits Ctrl+C
Must cancel ALL of these:
→ Abort LLM streaming (cancel HTTP request mid-stream)
→ Kill bash subprocess (and its ENTIRE process tree — it may have spawned children)
→ Cancel compaction (if running in background)
→ Cancel retry timer (if waiting for backoff)
→ Cancel branch summary (if generating)
→ Clean up temp files (partial writes)
→ Leave session in consistent state (so it can be resumed)
Production agents maintain 5+ separate AbortControllers
for different cancellable operations.
Killing a bash process is especially tricky — the command may have spawned child processes. You need to kill the entire process tree, not just the parent. And after aborting everything, the session file must be in a state that allows resumption.
10. Model Awareness — Not All LLMs Are Equal
Production agents don’t hardcode model assumptions. They maintain a model registry:
{
contextWindow: 200000, // How much can fit?
reasoning: true, // Supports thinking/reasoning?
thinkingLevel: "medium", // How deep to think?
provider: "anthropic", // Different API formats!
}
What changes per model:
Compaction thresholds → compact earlier for smaller context windows
Thinking configuration → enable/disable reasoning mode
API format → Anthropic vs OpenAI vs Bedrock message formats
Token counting → different tokenizers, different counts
Feature support → not all models support images, tools, or streaming
Users can hot-swap models mid-conversation. The agent adjusts its behavior — compaction strategy, thinking levels, API calls — based on which model is active. Without this, switching models mid-session either crashes or silently degrades.
The Iceberg
What you see:
LLM → Tool → Result → Loop
──────────────────────────────────────────────
What's underneath:
Context compaction with soft/hard thresholds
Error classification with exponential backoff
Message queuing and mid-stream steering
Dynamic system prompt assembly
Tool output sanitization and truncation
Persistent branching session trees (JSONL)
20+ extension hooks at every stage
Serial event queue (no race conditions)
Multi-resource abort coordination
Model-aware behavior adaptation
The basic loop is 50 lines of code. A production agent is 50,000+ lines. The gap is entirely in reliability, resumability, extensibility, and the thousand edge cases that tutorials skip.
Why This Matters for Agent Builders
If you’re building agents, you have three choices:
1. Use a framework (Strands, LangGraph, CrewAI)
→ Gets you maybe 60% of these problems solved
→ You still own context management, persistence, and error handling
2. Use a managed runtime (AgentCore, Bedrock Agents)
→ Gets you infrastructure + some session management
→ You still own the agent loop and tool integration
3. Build from scratch
→ You own all 10 problems
→ Full control, full responsibility
→ This is what Claude Code, Cursor, and Windsurf did
Most teams underestimate option 3 by 10x. The loop is easy. Everything else is the work.
References
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026