The Agent Loop Iceberg — 10 Hard Problems Hiding Beneath the Simple Loop

15th March 2026

The basic agent loop — LLM call, tool execution, observe result, repeat — is maybe 10% of a production agent’s code. The other 90% is making it reliable, resumable, extensible, and production-grade. After tracing through real agent source code, here are the ten hard problems hiding beneath the surface that nobody shows you in tutorials.

The Happy Path Everyone Shows You

while True:
    response = llm.call(messages)
    if response.has_tool_call:
        result = execute_tool(response.tool_call)
        messages.append(result)
    else:
        return response.text

This works in demos. It breaks in production. Here’s what’s underneath.

1. Context Window Is Finite — What Happens When It Fills Up?

The basic loop assumes infinite memory. In reality:

Turn 1:   User msg + Assistant response + Tool results   =  2K tokens
Turn 5:   All accumulated messages                        = 15K tokens
Turn 20:  All accumulated messages                        = 80K tokens
Turn 35:  BOOM — context overflow, API rejects the call

Production agents implement automatic compaction. When context approaches the limit:

1. Pick a "cut point" in message history
2. Send old messages to LLM: "Summarize what happened"
3. Replace everything before cut point with that summary
4. Track which files were read/modified (so the agent doesn't lose awareness)

The hidden complexity:
  When do you compact?
    Too early  = lose important context
    Too late   = overflow error

  Two triggers needed:
    Soft threshold → proactive compaction (before it's urgent)
    Hard overflow  → reactive compaction with auto-retry (emergency)

This is the same context rot problem from autoresearch, but solved differently. Autoresearch avoids it by being stateless. Long-running interactive agents can’t be stateless — they must manage the window actively.

2. Errors Don’t Mean “Stop” — They Mean “Wait and Retry”

Your mental model: LLM responds or fails. Reality:

API call → 429 Rate Limited     (wait 30s, retry)
API call → 502 Bad Gateway      (wait 2s, retry)
API call → 503 Overloaded       (wait 4s, retry)
API call → Context overflow     (compact, then retry)
API call → Success!

Production agents classify errors and handle each differently:

Retryable errors (429, 5xx, connection errors):
  → Exponential backoff: 1s → 2s → 4s → 8s → 16s
  → Up to N retries, then surface to user

Context overflow:
  → Don't retry blindly
  → Compact first, THEN retry
  → This is a different recovery path, not just "try again"

Client errors (400, auth failures):
  → Surface to user immediately, no retry
  → Retrying these wastes time and tokens

Without error classification, your agent dies on the first rate limit.

3. Users Don’t Wait — Steering and Queuing

Basic model: user sends message, waits for full response, sends next message. Reality: users want to interrupt or redirect mid-stream.

User:  "Refactor the auth module"
Agent: [streaming... reading files... calling tools...]
User:  "Actually, skip the tests, just do the main code"  ← WHILE AGENT IS RUNNING

Production agents handle this with two queue types:

Steer:
  Interrupt NOW, inject message into current turn
  Agent sees the new instruction before its next tool call
  Used for: corrections, redirections, "stop doing that"

Follow-up:
  Wait until agent finishes, then automatically send
  Agent completes current task, then starts the queued one
  Used for: "after that, also do X"

This is invisible to the user but critical for interactive agents. Without it, you either block all input during processing (bad UX) or lose messages (worse UX).

4. The System Prompt Is Dynamic, Not Static

Basic model: one fixed system prompt. Reality:

system_prompt  = base_instructions
system_prompt += tool_descriptions        # Changes if tools added/removed
system_prompt += tool_guidelines          # Per-tool usage hints
system_prompt += project_context          # CLAUDE.md files from cwd
system_prompt += skills_available         # Dynamically discovered
system_prompt += extension_injections     # Plugins modify it
system_prompt += f"Current date: {now}"
system_prompt += f"CWD: {cwd}"

The system prompt is rebuilt before every LLM invocation. Extensions can modify it via hooks. This means the agent’s behavior changes based on what project you’re in, what extensions are loaded, and what tools are registered — all without changing the core agent code.

5. Tool Results Need Processing, Not Just Passing Through

Basic model: tool returns string, send to LLM. Reality: tool output is messy, dangerous, and unbounded.

Bash output problems:
  Binary garbage (reading a .png with cat)  → must sanitize
  ANSI escape codes (colors, cursor)        → must strip
  Output too large (10MB log file)          → must truncate
  Output still streaming (long command)     → must stream to UI AND collect for LLM

Processing pipeline:
  Raw output
    → strip ANSI escape codes
    → detect and remove binary content
    → if > 64KB: write to temp file, truncate for LLM, include path to full output
    → stream chunks to UI in real-time
    → on completion: return truncated result + exit code + truncation flag

File read problems:
  File too large   → truncate with "[truncated]" indicator
  Image file       → resize and encode as base64 for multimodal LLMs
  Binary file      → reject gracefully with descriptive error

Without this pipeline, one cat /dev/urandom crashes your agent or burns your entire context window on garbage.

6. Persistence — Sessions Are Not Just Chat History

Basic model: conversation lives in memory, gone when process dies. Production agents persist everything to disk:

Every message appended to JSONL with tree structure:

{"type":"message","id":"m1","parentId":null,"message":{...}}
{"type":"message","id":"m2","parentId":"m1","message":{...}}
{"type":"compaction","id":"c1","summary":"...","firstKeptEntryId":"m2"}

Why a tree structure instead of a flat list? Because of branching:

m1 → m2 → m3 → m4  (original conversation)
              ↘ m5 → m6  (user went back and tried different approach)

You can fork a conversation at any point and explore alternatives. The JSONL log is append-only — nothing is ever deleted, just new branches created. Compaction summaries are stored inline so you can resume a session that was compacted weeks ago.

7. The Extension/Hook System — Every Event Is Interceptable

Basic model: monolithic loop. Production agents expose 20+ hook points where external code can intervene:

Hook Point                    What It Does
─────────────────────────────────────────────────────────
input                         Transform/block user input before LLM sees it
before_agent_start            Inject messages, modify system prompt
tool_execution_start          Approve/deny tool calls (permission system!)
tool_execution_end            Transform tool results
message_end                   React to LLM output
agent_end                     Post-processing
session_before_compact        Custom compaction strategy

This is how you build entire subsystems without modifying core agent code:

Permission systems    → hook into tool_execution_start, ask user before running bash
Logging/telemetry     → hook into every event, record tool calls and latency
Custom tools          → register new tools at runtime via before_agent_start
Guardrails            → hook into input, block dangerous prompts
Skills/plugins        → inject capabilities via extension hooks

8. Event Queue Serialization — Race Conditions Are Real

Basic model: process events as they come. Reality: events arrive asynchronously from the streaming API and must be processed in order.

// WRONG — race condition
agent.on("event", async (e) => {
    await saveToFile(e)      // What if two events fire before first save completes?
    await updateUI(e)        // Events processed out of order → corrupted session
})

// RIGHT — chain promises
handleEvent(event) {
    this.eventQueue = this.eventQueue.then(() => processEvent(event))
}

// Each event waits for the previous one to complete
// Order is guaranteed. No corruption. No lost messages.

Without event serialization, you get corrupted session files, UI glitches, and lost messages. This is a classic concurrency bug that’s invisible in demos (where events are slow) and catastrophic in production (where events arrive in bursts).

9. Abort Is Harder Than You Think

Basic model: cancel = stop. Reality: you need to cancel many things simultaneously:

Agent running → user hits Ctrl+C

  Must cancel ALL of these:
    → Abort LLM streaming        (cancel HTTP request mid-stream)
    → Kill bash subprocess        (and its ENTIRE process tree — it may have spawned children)
    → Cancel compaction           (if running in background)
    → Cancel retry timer          (if waiting for backoff)
    → Cancel branch summary       (if generating)
    → Clean up temp files         (partial writes)
    → Leave session in consistent state  (so it can be resumed)

Production agents maintain 5+ separate AbortControllers
for different cancellable operations.

Killing a bash process is especially tricky — the command may have spawned child processes. You need to kill the entire process tree, not just the parent. And after aborting everything, the session file must be in a state that allows resumption.

10. Model Awareness — Not All LLMs Are Equal

Production agents don’t hardcode model assumptions. They maintain a model registry:

{
    contextWindow: 200000,     // How much can fit?
    reasoning: true,           // Supports thinking/reasoning?
    thinkingLevel: "medium",   // How deep to think?
    provider: "anthropic",     // Different API formats!
}

What changes per model:
  Compaction thresholds     → compact earlier for smaller context windows
  Thinking configuration   → enable/disable reasoning mode
  API format               → Anthropic vs OpenAI vs Bedrock message formats
  Token counting           → different tokenizers, different counts
  Feature support           → not all models support images, tools, or streaming

Users can hot-swap models mid-conversation. The agent adjusts its behavior — compaction strategy, thinking levels, API calls — based on which model is active. Without this, switching models mid-session either crashes or silently degrades.

The Iceberg

What you see:

  LLM → Tool → Result → Loop

──────────────────────────────────────────────

What's underneath:

  Context compaction with soft/hard thresholds
  Error classification with exponential backoff
  Message queuing and mid-stream steering
  Dynamic system prompt assembly
  Tool output sanitization and truncation
  Persistent branching session trees (JSONL)
  20+ extension hooks at every stage
  Serial event queue (no race conditions)
  Multi-resource abort coordination
  Model-aware behavior adaptation

The basic loop is 50 lines of code. A production agent is 50,000+ lines. The gap is entirely in reliability, resumability, extensibility, and the thousand edge cases that tutorials skip.

Why This Matters for Agent Builders

If you’re building agents, you have three choices:

1. Use a framework (Strands, LangGraph, CrewAI)
   → Gets you maybe 60% of these problems solved
   → You still own context management, persistence, and error handling

2. Use a managed runtime (AgentCore, Bedrock Agents)
   → Gets you infrastructure + some session management
   → You still own the agent loop and tool integration

3. Build from scratch
   → You own all 10 problems
   → Full control, full responsibility
   → This is what Claude Code, Cursor, and Windsurf did

Most teams underestimate option 3 by 10x. The loop is easy. Everything else is the work.

References

Posted 15th March 2026 at 3:11 am · Subscribe to my newsletter

Akshay Parkhi's Weblog