Context Engineering for AI Agents: 6 Techniques from Claude Code, Manus, and Devin

9th March 2026

After studying how production AI agents like Claude Code, Manus, and Devin actually work under the hood, the single most important concept isn’t prompt engineering — it’s context engineering. The art of controlling exactly what goes into the model’s context window, and what stays out.

This post breaks down the problem, the failure modes, and the six techniques that production agents use to manage context effectively.

An Agent = Context + Model + Prompt + Tools

Everything an LLM can reason about lives in its context window. Think of it as the model’s working memory:

┌─── CONTEXT WINDOW (the LLM's RAM) ────────────────────────────────────┐
│                                                                       │
│  ┌─────────────┐  System prompt, rules, persona                       │
│  │   PROMPT     │  "You are an agent that..."                         │
│  └─────────────┘                                                      │
│  ┌─────────────┐  Function definitions, MCP servers                   │
│  │   TOOLS      │  search(), read_file(), run_tests()                 │
│  └─────────────┘                                                      │
│  ┌─────────────┐  RAG results, file contents, API data                │
│  │  KNOWLEDGE   │  Retrieved docs, embeddings, facts                  │
│  └─────────────┘                                                      │
│  ┌─────────────┐  Prior turns, tool results, errors                   │
│  │  MSG HISTORY │  user → assistant → tool → assistant → ...          │
│  └─────────────┘                                                      │
│  ┌─────────────┐  Few-shot examples, memories                         │
│  │  EXAMPLES    │  Demonstrations of desired behavior                 │
│  └─────────────┘                                                      │
│                                                                       │
│              ⚠  ALL OF THIS COMPETES FOR LIMITED ATTENTION             │
└───────────────────────────────────────────────────────────────────────┘
                         │
                         ▼
               ┌───────────────┐
               │   LLM Model   │  Attends over everything above
               └───────┬───────┘
                       │
                       ▼
              Next action or response

The core problem: every token in the window competes for the model’s attention. With n tokens, there are n² pairwise attention relationships. More tokens does not mean better results — often it means worse results.

Gemini agent degraded after ~100K tokens
Llama 3.1 405B degrades after ~32K tokens
Smaller models fail much earlier
Context has diminishing returns, just like human working memory

The Four Context Failure Modes

When agents fail, it’s almost always one of these four problems:

1. Context Poisoning

Agent writes incorrect info into its own context, then treats it as ground truth — chasing impossible goals.

Example: Gemini playing Pokémon hallucinated a game state, wrote it to its “goals” section, then spent hours chasing an impossible objective.

2. Context Distraction

Context so long the model over-focuses on history, ignoring its training. Starts repeating past actions instead of planning new ones.

Example: Gemini agent past 100K tokens just repeated previous actions from history instead of synthesizing new strategies.

3. Context Confusion

Irrelevant information (extra tools, docs) causes the model to call wrong tools or use wrong data.

Example: Llama 8B given 46 tools = FAILS. Llama 8B given 19 tools = SUCCEEDS. Same task, same model. Less is more.

4. Context Clash

Early wrong answers stay in context and conflict with later correct information.

Example: o3 scored 98.1% with a single prompt but 64.1% when the same information was spread across multiple turns — a 39% average drop — because early wrong guesses polluted the context.

The 6 Techniques to Fix Context

Technique 1: RAG (Retrieval-Augmented Generation)

Category: SELECT context Fixes: Confusion, Distraction

Instead of stuffing everything in, retrieve only what’s relevant:

WITHOUT RAG:                         WITH RAG:
┌──────────────────┐                 ┌──────────────────┐
│ 500 pages of docs│                 │ Query: "auth bug" │
│ all jammed in    │                 │        │          │
│ context window   │                 │        ▼          │
│ (confused model) │                 │ ┌──────────────┐  │
└──────────────────┘                 │ │ Vector search│  │
                                     │ └──────┬───────┘  │
                                     │        ▼          │
                                     │ Top 3 relevant    │
                                     │ chunks only       │
                                     │ (focused model)   │
                                     └──────────────────┘

Real example — Claude Code uses a hybrid model:

CLAUDE.md — always pre-loaded (procedural memory)
glob/grep — agent searches codebase on demand
file read — loads specific files when needed

Instead of indexing the entire codebase into context, it navigates like a human: browse → find → read.

Technique 2: Tool Loadout (Dynamic Tool Selection)

Category: SELECT context Fixes: Confusion

Don’t give all tools at once. Select the right “loadout” for the current task:

ALL TOOLS IN CONTEXT              TOOL LOADOUT
┌──────────────────────────┐    ┌──────────────────────────┐
│ search_web()             │    │                          │
│ read_file()              │    │  User query: "find bugs  │
│ write_file()             │    │  in auth module"         │
│ send_email()             │    │         ▼                │
│ create_calendar()        │    │  Tool RAG / Recommender  │
│ translate_text()         │    │         │                │
│ resize_image()           │    │         ▼                │
│ parse_pdf()              │    │  Selected tools:         │
│ ... 40 more tools ...    │    │  • read_file()           │
│                          │    │  • grep_search()         │
│  Model: "uhh which one?" │    │  • run_tests()           │
└──────────────────────────┘    │                          │
                                │  Model: "got it, clear!" │
                                └──────────────────────────┘

Real example — Manus masks tools instead of removing them. Tool definitions stay in context (preserving KV-cache) but logits are masked so the agent can’t select irrelevant tools. During a browser phase, browser_click() is allowed but shell_run() is masked.

Technique 3: Context Quarantine (Multi-Agent Isolation)

Category: ISOLATE context Fixes: Poisoning, Distraction

Split work across sub-agents, each with its own clean context:

        ┌──────────────────────────────────┐
        │        LEAD AGENT                 │
        │  (high-level plan + synthesis)    │
        │  Context: plan + summaries only   │
        └──────────┬───────────────────────┘
               │          │        │
       ┌───────▼──┐ ┌────▼─────┐ ┌▼──────────┐
       │ SubAgent │ │ SubAgent │ │ SubAgent   │
       │ "Search" │ │ "Analyze"│ │ "Code"     │
       │          │ │          │ │            │
       │ Own tools│ │ Own tools│ │ Own tools  │
       │ Own ctx  │ │ Own ctx  │ │ Own ctx    │
       └────┬─────┘ └────┬─────┘ └─────┬─────┘
            │             │             │
            ▼             ▼             ▼
        1-2K token    1-2K token    1-2K token
         summary       summary       summary
            │             │             │
            └─────────────┼─────────────┘
                          ▼
              Lead agent receives only
              compressed summaries
              (clean, focused context)

Real example — Anthropic multi-agent researcher: LeadResearcher (Opus) + SubAgents (Sonnet). Each sub-agent explores with fresh context. No cross-contamination. Lead agent only sees distilled results. Trade-off: uses ~15x more tokens total, but each individual context is clean and focused.

Technique 4: Context Pruning (Remove the Junk)

Category: COMPRESS context Fixes: Distraction, Confusion

Surgically remove low-value tokens from context:

BEFORE PRUNING:                    AFTER PRUNING:
┌─────────────────────┐           ┌─────────────────────┐
│ System prompt        │           │ System prompt        │
│ Tool definitions     │           │ Tool definitions     │
│ Old tool result #1   │           │ Current goal         │
│ Old tool result #2   │           │ Relevant finding     │
│ Stale search results │           │                      │
│ Old error traces     │           │ (70% smaller!)       │
│ Duplicate info       │           └─────────────────────┘
│ Current goal         │
│ Relevant finding     │
└─────────────────────┘

Real example — Provence pruner: A reranker model that takes a full Wikipedia article (thousands of tokens) and outputs only the relevant paragraphs — removing 95% while perfectly answering the question.

Technique 5: Context Summarization (Compress, Don’t Delete)

Category: COMPRESS context Fixes: Distraction, Window overflow

Use an LLM to distill long context into essential summaries:

┌─────────────────┐     ┌──────────────┐     ┌─────────────────┐
│ 95K tokens of   │────▶│  Summarizer  │────▶│ 5K summary      │
│ conversation    │     │  (LLM call)  │     │ + last 5 files  │
│ history         │     │              │     │ + recent errors  │
│                 │     │ Keeps:       │     │                 │
│ (approaching    │     │ • Decisions  │     │ (fresh context  │
│  window limit)  │     │ • Open bugs  │     │  agent continues│
│                 │     │ • Architecture│    │  with clarity)  │
│                 │     │ Drops:       │     │                 │
│                 │     │ • Old outputs│     │                 │
│                 │     │ • Redundancy │     │                 │
└─────────────────┘     └──────────────┘     └─────────────────┘

Real examples:

Claude Code “auto-compact” — Triggers at 95% of context window. Summarizes the full trajectory. Preserves architecture decisions and open bugs. Drops redundant tool outputs and old messages.
Cognition (Devin) — Uses a fine-tuned model for summarization at agent-agent handoff boundaries.
Anthropic “tool result eviction” — Once a tool call is deep in history, remove its raw output entirely. Safest, lightest form of compression.

Technique 6: Context Offloading (External Memory)

Category: WRITE context Fixes: Poisoning, Distraction, Window overflow

Use files, scratchpads, or databases as external memory. The agent writes notes out, reads them back in when needed:

┌───────────────────────────────────────────────┐
│              AGENT CONTEXT WINDOW              │
│                                               │
│  "Read todo.md to check progress..."          │
│  "Write findings to notes.md..."              │
│                                               │
│  (Window stays lean)                          │
└──────────────┬───────────────┬────────────────┘
               │ write         │ read
┌──────────────────────────────────────────────┐
│           EXTERNAL FILE SYSTEM                │
│                                               │
│  todo.md ──── task progress, checkboxes       │
│  notes.md ─── research findings               │
│  plan.md ──── architecture decisions           │
│  errors.log ─ past mistakes to avoid          │
│                                               │
│  (Unlimited size, persistent, searchable)     │
└──────────────────────────────────────────────┘

Real examples:

Manus — Agent creates todo.md and updates it throughout the task. By re-reading the todo, the plan appears at the END of context (recent attention window), preventing the “lost in the middle” problem.
Claude playing Pokémon — Maintained notes across thousands of game steps. Tracked progress (“1234 steps on Route 1, Pikachu level 8, target level 10”), built maps of explored areas. After context reset, read its own notes and continued.
Anthropic “think” tool — Agent writes reasoning to a scratchpad tool. Doesn’t clutter main context. Up to 54% improvement on specialized benchmarks.
Claude Code — CLAUDE.md + TASK.md + PLANNING.md as persistent files that survive context resets. Agent reads them at the start of every new conversation.

Production Tricks

KV-Cache Optimization (10x Cost Savings)

Cached tokens cost $0.30/MTok vs uncached at $3.00/MTok. Manus reports an average input:output ratio of 100:1 — almost all cost is in the input (prefill). Cache hits = massive savings.

Rules for maximizing cache hits:

Use deterministic JSON serialization (stable key order)
Set explicit cache breakpoints after the system prompt
Don’t put timestamps at the top of context (invalidates cache)
Don’t modify or reorder previous messages
Don’t add or remove tools mid-conversation

Keep Errors in Context

Don’t hide mistakes — they’re valuable signal:

WRONG: Agent makes error → retry silently → same error again

RIGHT: Agent makes error → error stays in context →
       model sees the failure → shifts strategy → recovers

Error recovery is one of the clearest indicators of a well-designed agent. The model needs to see what went wrong to avoid repeating it.

Break Repetition Patterns

If context fills with similar action-observation pairs, the model starts auto-completing the pattern instead of reasoning. Fix by injecting variation: different serialization templates, alternate phrasing of observations, minor randomness in formatting.

As the Manus team puts it: “The more uniform your context, the more brittle your agent becomes.”

Restorable Compression

When compressing context, keep enough metadata to re-fetch if needed:

Web page content → drop body, keep URL
Document content → drop text, keep file path
API response → drop payload, keep endpoint + params

Context shrinks without permanent information loss. The agent can always re-read or re-fetch later.

The Decision Matrix

Which technique fixes which failure mode:

Technique	Poisoning	Distraction	Confusion	Clash
RAG		Yes	Yes
Tool Loadout			Yes	Yes
Quarantine	Yes	Yes		Yes
Pruning		Yes	Yes
Summarization		Yes
Offloading	Yes	Yes

For long-running tasks, the choice depends on what you’re doing:

Strategy	Best For
Summarization	Long single-thread tasks, coding, writing
Multi-Agent	Parallel exploration, research, breadth
Hybrid	Most real-world agents (Claude Code, Manus)

Putting It All Together

Here’s how a real agent handles a complex task like “Migrate auth module from REST to GraphQL”:

STEP 1: Load minimal context
  System prompt (stable, cached)
  CLAUDE.md rules (always loaded)
  Selected tools: read, write, grep, test          ◄── Tool Loadout

STEP 2: Agent explores (progressive disclosure)
  grep "auth" src/ → finds 12 files                ◄── RAG (on-demand)
  read src/auth/routes.py → 200 lines
  read src/auth/models.py → 150 lines

STEP 3: Agent works + writes notes
  Creates schema.graphql
  Writes to todo.md: "✓ schema done,               ◄── Offloading
    ☐ resolvers, ☐ tests"

STEP 4: Context getting large (approaching limit)
  Auto-compact triggers                             ◄── Summarization
  Old tool outputs evicted                          ◄── Pruning
  Summary: "auth_login resolver has edge case bug"

STEP 5: Sub-agent for parallel work
  Spawns test-writing sub-agent                     ◄── Quarantine
  Sub-agent gets: schema + models only
  Returns: "14 tests, 2 failures in auth_login"
  Main agent gets 1K summary, not 50K of work

STEP 6: Agent reads todo.md at END of context
  Rewrites todo.md → pushes plan into               ◄── Offloading
  recency window: "☐ Fix auth_login edge case
    ☐ Remaining 4 resolvers
    ☐ Integration tests"

  Agent stays on track despite 200+ tool calls

The Golden Rule

Find the MINIMUM set of HIGH-SIGNAL tokens that MAXIMIZES the probability of your desired outcome. Every token must EARN its place in the context window.

Production agents don’t use one technique — they use all six as complementary strategies applied throughout the agent loop. RAG for selective retrieval, tool loadouts for focused capabilities, quarantine for isolation, pruning and summarization for compression, offloading for persistence. The goal is always the same: minimum tokens, maximum signal, at every step.

Posted 9th March 2026 at 12:39 am · Subscribe to my newsletter

Akshay Parkhi's Weblog