Claude Code’s Design Philosophy: 10 Patterns to Steal for Your Agent Systems

31st March 2026

A deep dive into Claude Code’s engineering decisions — the prompt architecture, tool philosophy, concurrency model, permission system, and memory design that make it work. Each section includes what you can apply to your own agent systems.

1. The Prompt Is The Product

Most agent builders treat prompts as an afterthought — write the tools and code first, then add a system prompt at the end. Claude Code inverts this: the prompt is the primary artifact, and everything else is built around it.

The system prompt is structured into independently iterable, A/B testable sections:

┌─────────────────────────────────────────────────────┐
│  getSimpleIntroSection()     ← Identity              │
│  getSimpleSystemSection()    ← Mechanics             │
│  getSimpleDoingTasksSection() ← Philosophy           │
│  getActionsSection()         ← Ethics                │
│  getUsingYourToolsSection()  ← Judgment              │
│  getOutputEfficiencySection() ← Style                │
│  getToneAndStyleSection()    ← Voice                 │
│                                                      │
│  ── DYNAMIC_BOUNDARY ───────── ← Cache break point  │
│                                                      │
│  getMemorySection()          ← Per-project context  │
│  getEnvironmentSection()     ← Per-session state    │
└─────────────────────────────────────────────────────┘

Everything above the boundary is static — same for all users, all sessions. It gets cached globally and the cache is shared across users. Everything below is dynamic per user or session and cannot be cached.

Two design details worth noting: @[MODEL LAUNCH] markers allow tuning per model generation without touching the rest of the prompt. Quantified anchors replace vague adjectives — “keep text between tool calls to ≤25 words” instead of “be concise.”

What to steal for your agent systems:

Split your prompt into named sections — you can’t A/B test what you can’t isolate
Put cacheable content first, dynamic content last
Use numbers not adjectives (“max 25 words” not “be brief”)
Version sections with model-generation tags so you can tune per model

2. Meta-Prompting — Teaching Judgment, Not Just API

A standard tool description tells the model what a tool does. Claude Code’s tool descriptions do three things:

WHAT it does — one line
WHEN to use it and when NOT to — decision logic with named alternatives
HOW to use it well — anti-patterns, safety rails, concrete examples

Plus a WHY — the reason behind the rule, so the model can generalize to novel situations.

For example, the Bash tool description doesn’t just say “runs shell commands.” It says: use Grep instead of running rg via Bash because the user gets a better review experience with dedicated tools. The model now knows the principle, not just the rule. It can apply that principle to tools and situations the prompt never explicitly covered.

This is why Claude Code picks the right tool at a high rate. Most agents pick based on keyword matching because their tool descriptions only answer “what” — not “when” or “why.”

What to steal:

Add a “WHEN NOT TO USE” section to every tool description
Add “PREFER X OVER Y” routing rules for overlapping tools
Include the WHY so the model can generalize to new situations
Put decision logic before parameter documentation

3. Generator-Based Streaming Architecture

Most agents wait for the model to finish streaming, then execute tools, then send results back. Claude Code starts executing tools while the model is still streaming.

Standard approach:
  request → wait → response → execute tools → send results

Claude Code approach:
  request → stream → parse tool_use block #1 → START executing tool #1
                   → parse tool_use block #2 → START executing tool #2
                                                (parallel if read-only)
                   → model finishes streaming
                   → tool #1 already done
                   → tool #2 finishing...

Tools are categorized by concurrency safety. Read-only tools (Glob, Grep, Read) run in parallel, up to 10 at once. Write tools run sequentially to avoid race conditions. If a Bash tool fails, sibling tools are aborted.

The practical impact: read-heavy turns (exploring a codebase, reading multiple files) finish significantly faster because file reads that would have been sequential now run in parallel during the same streaming window.

What to steal:

Parse tool calls from streaming chunks — don’t wait for the full response
Categorize tools as read-only vs write before execution
Run read-only tools in parallel (the latency win is significant)
Run write tools sequentially (avoids race conditions)
Abort sibling tools on critical failure

4. Five-Layer Permission System

Claude Code uses five independent layers to decide whether a tool call can proceed. Any one layer can block the operation. No layer trusts any other.

Layer	Scope	What it does
Input Validation	Per-tool, static	Schema check, path traversal prevention
Mode Policy	Session-scoped	Plan mode blocks all writes; auto mode defers to classifier
Rule Matching	Persistent whitelist	User-configured patterns like `Bash(npm run:*)`
Hook Evaluation	Extensible, async	PreToolUse hooks with custom logic; can modify inputs
Human Review	Multi-channel racing	Terminal UI, IDE bridge, mobile app, classifier — first responder wins

The racing pattern at Layer 5 is particularly interesting: six sources race concurrently for permission — terminal UI, IDE bridge, mobile channel, hooks, classifier, and a coordinator. The first to claim the decision wins atomically. This means a developer can approve from their phone while the terminal is waiting, and it works correctly without any race condition.

Critically, safety rules are enforced at two levels simultaneously. The prompt says “never force push to main.” The permission system independently blocks git push --force on protected branches. The model cannot override the mechanical check by reasoning its way around the prompt instruction.

What to steal:

Validate tool inputs mechanically — don’t rely on the model to self-police
Categorize tools by risk: read / write / destructive
Auto-approve reads, prompt for writes, hard-block dangerous operations
Make permission rules persistent and user-configurable
Keep “what the model wants” separate from “what the system allows”

5. Prompt Cache Economics

The cost math is stark. Without caching, a 50-turn session with a 20K-token system prompt wastes roughly 1 million input tokens. With proper caching structure, turns 2–50 hit the cache at a 90% discount.

Claude Code maximizes cache hits by obsessively controlling what changes between turns. The static section of the system prompt — identity, philosophy, tool descriptions, code quality rules — is identical for all users in all sessions. It gets cached at global scope, meaning the cache is shared across users, not just per-session.

Cache busting sources they track and avoid:

New MCP tools connected
GrowthBook feature flags refreshed
Auto mode toggled
Permission rules changed

Tool schemas are memoized per-session and survive GrowthBook refreshes. Forked agents share the parent’s prompt cache via byte-identical prefixes. The compact agent uses the same tracking key as the main thread. Microcompact sends “cache edits” instead of deleting messages — edits don’t break the cache, deletions do.

What to steal:

Put all static content before all dynamic content in your system prompt
Never mutate the static section between turns — append, don’t modify
For forked/sub-agents: use byte-identical prefixes to share the parent’s cache
Track cache breaks — one accidental break costs the equivalent of 5+ turns of savings

6. Intelligent Context Management

Claude Code never hits the API’s hard token limit because it compacts proactively using three strategies in order of cost.

Strategy 1: Microcompact — no API call required. Old tool results past a time threshold are replaced with [Old tool result cleared]. Cheap and fast, handles the common case.

Strategy 2: Proactive Compact — sends the full conversation to Claude for summarization. The summary prompt asks for: primary request and intent, key technical concepts, files and code sections with snippets, errors and fixes, all user messages verbatim, and pending tasks.

After compaction, the system doesn’t just resume — it reconstructs lost context:

Re-reads recently accessed files
Re-injects the active plan
Re-injects the active skill
Re-announces deferred tool schemas
Re-runs session start hooks

Strategy 3: Emergency Truncation — triggered when the API itself returns a “prompt too long” error. Drops oldest message groups (not individual messages) to recover the exact gap. Retries up to 3 times. Last resort: truncate oldest 20% of groups.

Post-compaction, over 10 caches are invalidated: microcompact state, context collapse state, memoized CLAUDE.md, memory files cache, system prompt sections, classifier approvals, speculative pre-fetch results, and more. Missing even one of these produces subtle bugs — stale permissions, wrong file contents, outdated tool schemas.

What to steal:

Implement three tiers of compaction: cheap (edit in place) → medium (API summarization) → expensive (truncation)
Never hit the hard API limit — compact proactively at ~80% of the context window
After compaction, re-inject lost context — don’t just summarize, rebuild the working state
Invalidate all caches after compaction — this is the source of hard-to-reproduce bugs

7. Memory As a Separate Agent

Instead of a vector database, Claude Code uses a file system with a dedicated extraction agent. After the main agent finishes a turn, a forked agent spawns with restricted tools (Read, Write, Edit — only to the memory directory; no Bash, no Agent, no MCP). It has a 5-turn maximum to prevent rabbit-holing. It advances a cursor to track what it has already processed.

Retrieval at query time works differently from similarity search. All memory file frontmatter is scanned, sent to a cheap fast model (Sonnet or Haiku), which picks up to 5 relevant files. Those files are attached as context to the user’s message.

Memory is organized into four typed categories:

Type	What it stores	Purpose
`user`	Role, expertise, preferences	Tailor future responses to this person
`feedback`	Corrections and confirmed approaches	Avoid repeating mistakes; continue what worked
`project`	Goals, decisions, deadlines, constraints	Understand why the work matters
`reference`	Pointers to external systems	Reduce “where is X?” questions

They also explicitly define what NOT to save: code patterns (derivable from code), git history (derivable from git log), fix recipes (the fix is in the code), anything already in CLAUDE.md, and ephemeral task state (use tasks, not memory). This prevents bloat that would degrade retrieval quality over time.

Mutual exclusion prevents duplicates: if the main agent wrote memories during a turn, auto-extraction skips that turn.

What to steal:

Use a separate agent for memory extraction — restricted tools and a turn limit prevent it from becoming a side project
Type your memories — types enable smarter retrieval than similarity alone
Use a cheap model for retrieval (Haiku picks candidates, Opus processes the query)
Frontmatter enables structured filtering without reading full file contents
Define explicit “what NOT to save” rules — omission is as important as inclusion

8. Principle-Based Safety

Rule lists fail on unseen inputs. “Don’t delete files” doesn’t cover shred, truncate, or dd if=/dev/zero. Claude Code uses principles instead of rules, with rules as examples of the principles.

The core principle: consider reversibility and blast radius. Local, reversible actions proceed freely. Hard-to-reverse or shared-state actions get a confirmation step. The cost of pausing is low. The cost of an unwanted action is high.

This generalizes naturally. A new command the prompt never mentioned — shred, for instance — gets evaluated against the principle: is it reversible? What’s the blast radius? The model can reason correctly about tools that don’t exist yet.

CRITICAL/IMPORTANT/normal emphasis levels are used deliberately, not liberally. Overusing CRITICAL trains the model to treat everything as equally urgent, which defeats the purpose.

What to steal:

Lead with principles (“consider reversibility”), follow with examples of the principle
Use three emphasis levels sparingly — their power comes from scarcity
Include anti-patterns (“when NOT to do X”) alongside rules
Include the WHY behind every rule so the model can judge edge cases

9. Deferred Tool Loading

Thirty-plus core tools plus fifty-plus MCP tools equals roughly 100K tokens of tool schemas if loaded all at once. Claude Code defers tools that aren’t needed immediately.

A session starts with approximately 15 core tools loaded with full schemas: Bash, Read, Write, Edit, Glob, Grep, Agent, and a few others. The remaining 30+ tools are listed by name only — no schema, minimal token cost. When the model needs a deferred tool, it calls a meta-tool (ToolSearch) which loads the full schema on demand.

This scales to 100+ tools without context bloat. It also means MCP tools from rarely-used servers don’t eat context on every turn of a session that never touches them.

What to steal:

If your agent has more than 15 tools, load the 10–15 most common with full schemas
List remaining tools by name only
Provide a “discover_tool” meta-tool that loads full schemas on demand

10. The “Information Will Disappear” Pattern

One small prompt instruction with outsized impact:

“When working with tool results, write down any important information you might need later in your response, as the original tool result may be cleared later.”

This turns a limitation (context compaction clears tool results) into a deliberate behavior. The model becomes its own note-taker:

Reads a file → writes down the key lines in its response text
Runs a command → summarizes the output before continuing
Searches code → extracts the relevant paths and functions

Post-compaction, the model’s own notes survive in the summary. The information was “saved” by the model itself, not by any infrastructure. This costs nothing and requires no tooling changes.

What to steal: add this exact pattern to your agent prompt. Simple, effective, and makes the model self-documenting.

Ranked by Impact

Rank	Pattern	Effort	Impact
1	Meta-prompt your tools (WHAT + WHEN NOT + WHY)	Easy	High
2	Stream + parallel tool execution	Hard	High
3	Modular prompt sections (static first, dynamic last)	Easy	High
4	Three-tier compaction (microcompact → summarize → truncate)	Medium	High
5	Mechanical safety layer (validate before execute)	Medium	High
6	“Information will disappear” prompt	Easy	Medium
7	Typed memory system (user/feedback/project/reference)	Medium	Medium
8	Separate memory extraction agent (restricted tools, turn limit)	Medium	Medium
9	Deferred tool loading (name-only + on-demand schema)	Easy	Medium
10	Principle-based safety (“consider reversibility”)	Easy	Medium

The Real Moat

None of these patterns works in isolation. The prompt cache strategy shapes the prompt structure. The prompt structure shapes how tool descriptions are written. The tool descriptions shape what the permission system needs to enforce. The permission system shapes how the memory extraction agent is scoped. The memory design shapes what context management needs to preserve.

Each design decision reinforces the others. That’s the moat — not any individual feature, but the coherence between all of them.

The single biggest lesson: Claude Code treats prompt engineering as a first-class engineering discipline — versioned, measured, A/B tested, and architected with the same rigor as the runtime code. The gap between that approach and treating prompts as config strings is where most of the performance difference lives.

Posted 31st March 2026 at 1:04 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog