Context Engineering for AI Agents: 6 Techniques from Claude Code, Manus, and Devin
9th March 2026
After studying how production AI agents like Claude Code, Manus, and Devin actually work under the hood, the single most important concept isn’t prompt engineering — it’s context engineering. The art of controlling exactly what goes into the model’s context window, and what stays out.
This post breaks down the problem, the failure modes, and the six techniques that production agents use to manage context effectively.
An Agent = Context + Model + Prompt + Tools
Everything an LLM can reason about lives in its context window. Think of it as the model’s working memory:
┌─── CONTEXT WINDOW (the LLM's RAM) ────────────────────────────────────┐
│ │
│ ┌─────────────┐ System prompt, rules, persona │
│ │ PROMPT │ "You are an agent that..." │
│ └─────────────┘ │
│ ┌─────────────┐ Function definitions, MCP servers │
│ │ TOOLS │ search(), read_file(), run_tests() │
│ └─────────────┘ │
│ ┌─────────────┐ RAG results, file contents, API data │
│ │ KNOWLEDGE │ Retrieved docs, embeddings, facts │
│ └─────────────┘ │
│ ┌─────────────┐ Prior turns, tool results, errors │
│ │ MSG HISTORY │ user → assistant → tool → assistant → ... │
│ └─────────────┘ │
│ ┌─────────────┐ Few-shot examples, memories │
│ │ EXAMPLES │ Demonstrations of desired behavior │
│ └─────────────┘ │
│ │
│ ⚠ ALL OF THIS COMPETES FOR LIMITED ATTENTION │
└───────────────────────────────────────────────────────────────────────┘
│
▼
┌───────────────┐
│ LLM Model │ Attends over everything above
└───────┬───────┘
│
▼
Next action or response
The core problem: every token in the window competes for the model’s attention. With n tokens, there are n² pairwise attention relationships. More tokens does not mean better results — often it means worse results.
- Gemini agent degraded after ~100K tokens
- Llama 3.1 405B degrades after ~32K tokens
- Smaller models fail much earlier
- Context has diminishing returns, just like human working memory
The Four Context Failure Modes
When agents fail, it’s almost always one of these four problems:
1. Context Poisoning
Agent writes incorrect info into its own context, then treats it as ground truth — chasing impossible goals.
Example: Gemini playing Pokémon hallucinated a game state, wrote it to its “goals” section, then spent hours chasing an impossible objective.
2. Context Distraction
Context so long the model over-focuses on history, ignoring its training. Starts repeating past actions instead of planning new ones.
Example: Gemini agent past 100K tokens just repeated previous actions from history instead of synthesizing new strategies.
3. Context Confusion
Irrelevant information (extra tools, docs) causes the model to call wrong tools or use wrong data.
Example: Llama 8B given 46 tools = FAILS. Llama 8B given 19 tools = SUCCEEDS. Same task, same model. Less is more.
4. Context Clash
Early wrong answers stay in context and conflict with later correct information.
Example: o3 scored 98.1% with a single prompt but 64.1% when the same information was spread across multiple turns — a 39% average drop — because early wrong guesses polluted the context.
The 6 Techniques to Fix Context
Technique 1: RAG (Retrieval-Augmented Generation)
Category: SELECT context Fixes: Confusion, Distraction
Instead of stuffing everything in, retrieve only what’s relevant:
WITHOUT RAG: WITH RAG:
┌──────────────────┐ ┌──────────────────┐
│ 500 pages of docs│ │ Query: "auth bug" │
│ all jammed in │ │ │ │
│ context window │ │ ▼ │
│ (confused model) │ │ ┌──────────────┐ │
└──────────────────┘ │ │ Vector search│ │
│ └──────┬───────┘ │
│ ▼ │
│ Top 3 relevant │
│ chunks only │
│ (focused model) │
└──────────────────┘
Real example — Claude Code uses a hybrid model:
CLAUDE.md— always pre-loaded (procedural memory)glob/grep— agent searches codebase on demandfile read— loads specific files when needed
Instead of indexing the entire codebase into context, it navigates like a human: browse → find → read.
Technique 2: Tool Loadout (Dynamic Tool Selection)
Category: SELECT context Fixes: Confusion
Don’t give all tools at once. Select the right “loadout” for the current task:
ALL TOOLS IN CONTEXT TOOL LOADOUT
┌──────────────────────────┐ ┌──────────────────────────┐
│ search_web() │ │ │
│ read_file() │ │ User query: "find bugs │
│ write_file() │ │ in auth module" │
│ send_email() │ │ ▼ │
│ create_calendar() │ │ Tool RAG / Recommender │
│ translate_text() │ │ │ │
│ resize_image() │ │ ▼ │
│ parse_pdf() │ │ Selected tools: │
│ ... 40 more tools ... │ │ • read_file() │
│ │ │ • grep_search() │
│ Model: "uhh which one?" │ │ • run_tests() │
└──────────────────────────┘ │ │
│ Model: "got it, clear!" │
└──────────────────────────┘
Real example — Manus masks tools instead of removing them. Tool definitions stay in context (preserving KV-cache) but logits are masked so the agent can’t select irrelevant tools. During a browser phase, browser_click() is allowed but shell_run() is masked.
Technique 3: Context Quarantine (Multi-Agent Isolation)
Category: ISOLATE context Fixes: Poisoning, Distraction
Split work across sub-agents, each with its own clean context:
┌──────────────────────────────────┐
│ LEAD AGENT │
│ (high-level plan + synthesis) │
│ Context: plan + summaries only │
└──────────┬───────────────────────┘
│ │ │
┌───────▼──┐ ┌────▼─────┐ ┌▼──────────┐
│ SubAgent │ │ SubAgent │ │ SubAgent │
│ "Search" │ │ "Analyze"│ │ "Code" │
│ │ │ │ │ │
│ Own tools│ │ Own tools│ │ Own tools │
│ Own ctx │ │ Own ctx │ │ Own ctx │
└────┬─────┘ └────┬─────┘ └─────┬─────┘
│ │ │
▼ ▼ ▼
1-2K token 1-2K token 1-2K token
summary summary summary
│ │ │
└─────────────┼─────────────┘
▼
Lead agent receives only
compressed summaries
(clean, focused context)
Real example — Anthropic multi-agent researcher: LeadResearcher (Opus) + SubAgents (Sonnet). Each sub-agent explores with fresh context. No cross-contamination. Lead agent only sees distilled results. Trade-off: uses ~15x more tokens total, but each individual context is clean and focused.
Technique 4: Context Pruning (Remove the Junk)
Category: COMPRESS context Fixes: Distraction, Confusion
Surgically remove low-value tokens from context:
BEFORE PRUNING: AFTER PRUNING:
┌─────────────────────┐ ┌─────────────────────┐
│ System prompt │ │ System prompt │
│ Tool definitions │ │ Tool definitions │
│ Old tool result #1 │ │ Current goal │
│ Old tool result #2 │ │ Relevant finding │
│ Stale search results │ │ │
│ Old error traces │ │ (70% smaller!) │
│ Duplicate info │ └─────────────────────┘
│ Current goal │
│ Relevant finding │
└─────────────────────┘
Real example — Provence pruner: A reranker model that takes a full Wikipedia article (thousands of tokens) and outputs only the relevant paragraphs — removing 95% while perfectly answering the question.
Technique 5: Context Summarization (Compress, Don’t Delete)
Category: COMPRESS context Fixes: Distraction, Window overflow
Use an LLM to distill long context into essential summaries:
┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ 95K tokens of │────▶│ Summarizer │────▶│ 5K summary │
│ conversation │ │ (LLM call) │ │ + last 5 files │
│ history │ │ │ │ + recent errors │
│ │ │ Keeps: │ │ │
│ (approaching │ │ • Decisions │ │ (fresh context │
│ window limit) │ │ • Open bugs │ │ agent continues│
│ │ │ • Architecture│ │ with clarity) │
│ │ │ Drops: │ │ │
│ │ │ • Old outputs│ │ │
│ │ │ • Redundancy │ │ │
└─────────────────┘ └──────────────┘ └─────────────────┘
Real examples:
- Claude Code “auto-compact” — Triggers at 95% of context window. Summarizes the full trajectory. Preserves architecture decisions and open bugs. Drops redundant tool outputs and old messages.
- Cognition (Devin) — Uses a fine-tuned model for summarization at agent-agent handoff boundaries.
- Anthropic “tool result eviction” — Once a tool call is deep in history, remove its raw output entirely. Safest, lightest form of compression.
Technique 6: Context Offloading (External Memory)
Category: WRITE context Fixes: Poisoning, Distraction, Window overflow
Use files, scratchpads, or databases as external memory. The agent writes notes out, reads them back in when needed:
┌───────────────────────────────────────────────┐
│ AGENT CONTEXT WINDOW │
│ │
│ "Read todo.md to check progress..." │
│ "Write findings to notes.md..." │
│ │
│ (Window stays lean) │
└──────────────┬───────────────┬────────────────┘
│ write │ read
┌──────────────────────────────────────────────┐
│ EXTERNAL FILE SYSTEM │
│ │
│ todo.md ──── task progress, checkboxes │
│ notes.md ─── research findings │
│ plan.md ──── architecture decisions │
│ errors.log ─ past mistakes to avoid │
│ │
│ (Unlimited size, persistent, searchable) │
└──────────────────────────────────────────────┘
Real examples:
- Manus — Agent creates
todo.mdand updates it throughout the task. By re-reading the todo, the plan appears at the END of context (recent attention window), preventing the “lost in the middle” problem. - Claude playing Pokémon — Maintained notes across thousands of game steps. Tracked progress (“1234 steps on Route 1, Pikachu level 8, target level 10”), built maps of explored areas. After context reset, read its own notes and continued.
- Anthropic “think” tool — Agent writes reasoning to a scratchpad tool. Doesn’t clutter main context. Up to 54% improvement on specialized benchmarks.
- Claude Code —
CLAUDE.md+TASK.md+PLANNING.mdas persistent files that survive context resets. Agent reads them at the start of every new conversation.
Production Tricks
KV-Cache Optimization (10x Cost Savings)
Cached tokens cost $0.30/MTok vs uncached at $3.00/MTok. Manus reports an average input:output ratio of 100:1 — almost all cost is in the input (prefill). Cache hits = massive savings.
Rules for maximizing cache hits:
- Use deterministic JSON serialization (stable key order)
- Set explicit cache breakpoints after the system prompt
- Don’t put timestamps at the top of context (invalidates cache)
- Don’t modify or reorder previous messages
- Don’t add or remove tools mid-conversation
Keep Errors in Context
Don’t hide mistakes — they’re valuable signal:
WRONG: Agent makes error → retry silently → same error again
RIGHT: Agent makes error → error stays in context →
model sees the failure → shifts strategy → recovers
Error recovery is one of the clearest indicators of a well-designed agent. The model needs to see what went wrong to avoid repeating it.
Break Repetition Patterns
If context fills with similar action-observation pairs, the model starts auto-completing the pattern instead of reasoning. Fix by injecting variation: different serialization templates, alternate phrasing of observations, minor randomness in formatting.
As the Manus team puts it: “The more uniform your context, the more brittle your agent becomes.”
Restorable Compression
When compressing context, keep enough metadata to re-fetch if needed:
- Web page content → drop body, keep URL
- Document content → drop text, keep file path
- API response → drop payload, keep endpoint + params
Context shrinks without permanent information loss. The agent can always re-read or re-fetch later.
The Decision Matrix
Which technique fixes which failure mode:
| Technique | Poisoning | Distraction | Confusion | Clash |
|---|---|---|---|---|
| RAG | Yes | Yes | ||
| Tool Loadout | Yes | Yes | ||
| Quarantine | Yes | Yes | Yes | |
| Pruning | Yes | Yes | ||
| Summarization | Yes | |||
| Offloading | Yes | Yes |
For long-running tasks, the choice depends on what you’re doing:
| Strategy | Best For |
|---|---|
| Summarization | Long single-thread tasks, coding, writing |
| Multi-Agent | Parallel exploration, research, breadth |
| Hybrid | Most real-world agents (Claude Code, Manus) |
Putting It All Together
Here’s how a real agent handles a complex task like “Migrate auth module from REST to GraphQL”:
STEP 1: Load minimal context
System prompt (stable, cached)
CLAUDE.md rules (always loaded)
Selected tools: read, write, grep, test ◄── Tool Loadout
STEP 2: Agent explores (progressive disclosure)
grep "auth" src/ → finds 12 files ◄── RAG (on-demand)
read src/auth/routes.py → 200 lines
read src/auth/models.py → 150 lines
STEP 3: Agent works + writes notes
Creates schema.graphql
Writes to todo.md: "✓ schema done, ◄── Offloading
☐ resolvers, ☐ tests"
STEP 4: Context getting large (approaching limit)
Auto-compact triggers ◄── Summarization
Old tool outputs evicted ◄── Pruning
Summary: "auth_login resolver has edge case bug"
STEP 5: Sub-agent for parallel work
Spawns test-writing sub-agent ◄── Quarantine
Sub-agent gets: schema + models only
Returns: "14 tests, 2 failures in auth_login"
Main agent gets 1K summary, not 50K of work
STEP 6: Agent reads todo.md at END of context
Rewrites todo.md → pushes plan into ◄── Offloading
recency window: "☐ Fix auth_login edge case
☐ Remaining 4 resolvers
☐ Integration tests"
Agent stays on track despite 200+ tool calls
The Golden Rule
Find the MINIMUM set of HIGH-SIGNAL tokens that MAXIMIZES the probability of your desired outcome. Every token must EARN its place in the context window.
Production agents don’t use one technique — they use all six as complementary strategies applied throughout the agent loop. RAG for selective retrieval, tool loadouts for focused capabilities, quarantine for isolation, pruning and summarization for compression, offloading for persistence. The goal is always the same: minimum tokens, maximum signal, at every step.
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026