Claude Code’s Design Philosophy: 10 Patterns to Steal for Your Agent Systems
31st March 2026
A deep dive into Claude Code’s engineering decisions — the prompt architecture, tool philosophy, concurrency model, permission system, and memory design that make it work. Each section includes what you can apply to your own agent systems.
1. The Prompt Is The Product
Most agent builders treat prompts as an afterthought — write the tools and code first, then add a system prompt at the end. Claude Code inverts this: the prompt is the primary artifact, and everything else is built around it.
The system prompt is structured into independently iterable, A/B testable sections:
┌─────────────────────────────────────────────────────┐
│ getSimpleIntroSection() ← Identity │
│ getSimpleSystemSection() ← Mechanics │
│ getSimpleDoingTasksSection() ← Philosophy │
│ getActionsSection() ← Ethics │
│ getUsingYourToolsSection() ← Judgment │
│ getOutputEfficiencySection() ← Style │
│ getToneAndStyleSection() ← Voice │
│ │
│ ── DYNAMIC_BOUNDARY ───────── ← Cache break point │
│ │
│ getMemorySection() ← Per-project context │
│ getEnvironmentSection() ← Per-session state │
└─────────────────────────────────────────────────────┘
Everything above the boundary is static — same for all users, all sessions. It gets cached globally and the cache is shared across users. Everything below is dynamic per user or session and cannot be cached.
Two design details worth noting: @[MODEL LAUNCH] markers allow tuning per model generation without touching the rest of the prompt. Quantified anchors replace vague adjectives — “keep text between tool calls to ≤25 words” instead of “be concise.”
What to steal for your agent systems:
- Split your prompt into named sections — you can’t A/B test what you can’t isolate
- Put cacheable content first, dynamic content last
- Use numbers not adjectives (“max 25 words” not “be brief”)
- Version sections with model-generation tags so you can tune per model
2. Meta-Prompting — Teaching Judgment, Not Just API
A standard tool description tells the model what a tool does. Claude Code’s tool descriptions do three things:
- WHAT it does — one line
- WHEN to use it and when NOT to — decision logic with named alternatives
- HOW to use it well — anti-patterns, safety rails, concrete examples
Plus a WHY — the reason behind the rule, so the model can generalize to novel situations.
For example, the Bash tool description doesn’t just say “runs shell commands.” It says: use Grep instead of running rg via Bash because the user gets a better review experience with dedicated tools. The model now knows the principle, not just the rule. It can apply that principle to tools and situations the prompt never explicitly covered.
This is why Claude Code picks the right tool at a high rate. Most agents pick based on keyword matching because their tool descriptions only answer “what” — not “when” or “why.”
What to steal:
- Add a “WHEN NOT TO USE” section to every tool description
- Add “PREFER X OVER Y” routing rules for overlapping tools
- Include the WHY so the model can generalize to new situations
- Put decision logic before parameter documentation
3. Generator-Based Streaming Architecture
Most agents wait for the model to finish streaming, then execute tools, then send results back. Claude Code starts executing tools while the model is still streaming.
Standard approach:
request → wait → response → execute tools → send results
Claude Code approach:
request → stream → parse tool_use block #1 → START executing tool #1
→ parse tool_use block #2 → START executing tool #2
(parallel if read-only)
→ model finishes streaming
→ tool #1 already done
→ tool #2 finishing...
Tools are categorized by concurrency safety. Read-only tools (Glob, Grep, Read) run in parallel, up to 10 at once. Write tools run sequentially to avoid race conditions. If a Bash tool fails, sibling tools are aborted.
The practical impact: read-heavy turns (exploring a codebase, reading multiple files) finish significantly faster because file reads that would have been sequential now run in parallel during the same streaming window.
What to steal:
- Parse tool calls from streaming chunks — don’t wait for the full response
- Categorize tools as read-only vs write before execution
- Run read-only tools in parallel (the latency win is significant)
- Run write tools sequentially (avoids race conditions)
- Abort sibling tools on critical failure
4. Five-Layer Permission System
Claude Code uses five independent layers to decide whether a tool call can proceed. Any one layer can block the operation. No layer trusts any other.
| Layer | Scope | What it does |
|---|---|---|
| Input Validation | Per-tool, static | Schema check, path traversal prevention |
| Mode Policy | Session-scoped | Plan mode blocks all writes; auto mode defers to classifier |
| Rule Matching | Persistent whitelist | User-configured patterns like Bash(npm run:*) |
| Hook Evaluation | Extensible, async | PreToolUse hooks with custom logic; can modify inputs |
| Human Review | Multi-channel racing | Terminal UI, IDE bridge, mobile app, classifier — first responder wins |
The racing pattern at Layer 5 is particularly interesting: six sources race concurrently for permission — terminal UI, IDE bridge, mobile channel, hooks, classifier, and a coordinator. The first to claim the decision wins atomically. This means a developer can approve from their phone while the terminal is waiting, and it works correctly without any race condition.
Critically, safety rules are enforced at two levels simultaneously. The prompt says “never force push to main.” The permission system independently blocks git push --force on protected branches. The model cannot override the mechanical check by reasoning its way around the prompt instruction.
What to steal:
- Validate tool inputs mechanically — don’t rely on the model to self-police
- Categorize tools by risk: read / write / destructive
- Auto-approve reads, prompt for writes, hard-block dangerous operations
- Make permission rules persistent and user-configurable
- Keep “what the model wants” separate from “what the system allows”
5. Prompt Cache Economics
The cost math is stark. Without caching, a 50-turn session with a 20K-token system prompt wastes roughly 1 million input tokens. With proper caching structure, turns 2–50 hit the cache at a 90% discount.
Claude Code maximizes cache hits by obsessively controlling what changes between turns. The static section of the system prompt — identity, philosophy, tool descriptions, code quality rules — is identical for all users in all sessions. It gets cached at global scope, meaning the cache is shared across users, not just per-session.
Cache busting sources they track and avoid:
- New MCP tools connected
- GrowthBook feature flags refreshed
- Auto mode toggled
- Permission rules changed
Tool schemas are memoized per-session and survive GrowthBook refreshes. Forked agents share the parent’s prompt cache via byte-identical prefixes. The compact agent uses the same tracking key as the main thread. Microcompact sends “cache edits” instead of deleting messages — edits don’t break the cache, deletions do.
What to steal:
- Put all static content before all dynamic content in your system prompt
- Never mutate the static section between turns — append, don’t modify
- For forked/sub-agents: use byte-identical prefixes to share the parent’s cache
- Track cache breaks — one accidental break costs the equivalent of 5+ turns of savings
6. Intelligent Context Management
Claude Code never hits the API’s hard token limit because it compacts proactively using three strategies in order of cost.
Strategy 1: Microcompact — no API call required. Old tool results past a time threshold are replaced with [Old tool result cleared]. Cheap and fast, handles the common case.
Strategy 2: Proactive Compact — sends the full conversation to Claude for summarization. The summary prompt asks for: primary request and intent, key technical concepts, files and code sections with snippets, errors and fixes, all user messages verbatim, and pending tasks.
After compaction, the system doesn’t just resume — it reconstructs lost context:
- Re-reads recently accessed files
- Re-injects the active plan
- Re-injects the active skill
- Re-announces deferred tool schemas
- Re-runs session start hooks
Strategy 3: Emergency Truncation — triggered when the API itself returns a “prompt too long” error. Drops oldest message groups (not individual messages) to recover the exact gap. Retries up to 3 times. Last resort: truncate oldest 20% of groups.
Post-compaction, over 10 caches are invalidated: microcompact state, context collapse state, memoized CLAUDE.md, memory files cache, system prompt sections, classifier approvals, speculative pre-fetch results, and more. Missing even one of these produces subtle bugs — stale permissions, wrong file contents, outdated tool schemas.
What to steal:
- Implement three tiers of compaction: cheap (edit in place) → medium (API summarization) → expensive (truncation)
- Never hit the hard API limit — compact proactively at ~80% of the context window
- After compaction, re-inject lost context — don’t just summarize, rebuild the working state
- Invalidate all caches after compaction — this is the source of hard-to-reproduce bugs
7. Memory As a Separate Agent
Instead of a vector database, Claude Code uses a file system with a dedicated extraction agent. After the main agent finishes a turn, a forked agent spawns with restricted tools (Read, Write, Edit — only to the memory directory; no Bash, no Agent, no MCP). It has a 5-turn maximum to prevent rabbit-holing. It advances a cursor to track what it has already processed.
Retrieval at query time works differently from similarity search. All memory file frontmatter is scanned, sent to a cheap fast model (Sonnet or Haiku), which picks up to 5 relevant files. Those files are attached as context to the user’s message.
Memory is organized into four typed categories:
| Type | What it stores | Purpose |
|---|---|---|
user | Role, expertise, preferences | Tailor future responses to this person |
feedback | Corrections and confirmed approaches | Avoid repeating mistakes; continue what worked |
project | Goals, decisions, deadlines, constraints | Understand why the work matters |
reference | Pointers to external systems | Reduce “where is X?” questions |
They also explicitly define what NOT to save: code patterns (derivable from code), git history (derivable from git log), fix recipes (the fix is in the code), anything already in CLAUDE.md, and ephemeral task state (use tasks, not memory). This prevents bloat that would degrade retrieval quality over time.
Mutual exclusion prevents duplicates: if the main agent wrote memories during a turn, auto-extraction skips that turn.
What to steal:
- Use a separate agent for memory extraction — restricted tools and a turn limit prevent it from becoming a side project
- Type your memories — types enable smarter retrieval than similarity alone
- Use a cheap model for retrieval (Haiku picks candidates, Opus processes the query)
- Frontmatter enables structured filtering without reading full file contents
- Define explicit “what NOT to save” rules — omission is as important as inclusion
8. Principle-Based Safety
Rule lists fail on unseen inputs. “Don’t delete files” doesn’t cover shred, truncate, or dd if=/dev/zero. Claude Code uses principles instead of rules, with rules as examples of the principles.
The core principle: consider reversibility and blast radius. Local, reversible actions proceed freely. Hard-to-reverse or shared-state actions get a confirmation step. The cost of pausing is low. The cost of an unwanted action is high.
This generalizes naturally. A new command the prompt never mentioned — shred, for instance — gets evaluated against the principle: is it reversible? What’s the blast radius? The model can reason correctly about tools that don’t exist yet.
CRITICAL/IMPORTANT/normal emphasis levels are used deliberately, not liberally. Overusing CRITICAL trains the model to treat everything as equally urgent, which defeats the purpose.
What to steal:
- Lead with principles (“consider reversibility”), follow with examples of the principle
- Use three emphasis levels sparingly — their power comes from scarcity
- Include anti-patterns (“when NOT to do X”) alongside rules
- Include the WHY behind every rule so the model can judge edge cases
9. Deferred Tool Loading
Thirty-plus core tools plus fifty-plus MCP tools equals roughly 100K tokens of tool schemas if loaded all at once. Claude Code defers tools that aren’t needed immediately.
A session starts with approximately 15 core tools loaded with full schemas: Bash, Read, Write, Edit, Glob, Grep, Agent, and a few others. The remaining 30+ tools are listed by name only — no schema, minimal token cost. When the model needs a deferred tool, it calls a meta-tool (ToolSearch) which loads the full schema on demand.
This scales to 100+ tools without context bloat. It also means MCP tools from rarely-used servers don’t eat context on every turn of a session that never touches them.
What to steal:
- If your agent has more than 15 tools, load the 10–15 most common with full schemas
- List remaining tools by name only
- Provide a “discover_tool” meta-tool that loads full schemas on demand
10. The “Information Will Disappear” Pattern
One small prompt instruction with outsized impact:
“When working with tool results, write down any important information you might need later in your response, as the original tool result may be cleared later.”
This turns a limitation (context compaction clears tool results) into a deliberate behavior. The model becomes its own note-taker:
- Reads a file → writes down the key lines in its response text
- Runs a command → summarizes the output before continuing
- Searches code → extracts the relevant paths and functions
Post-compaction, the model’s own notes survive in the summary. The information was “saved” by the model itself, not by any infrastructure. This costs nothing and requires no tooling changes.
What to steal: add this exact pattern to your agent prompt. Simple, effective, and makes the model self-documenting.
Ranked by Impact
| Rank | Pattern | Effort | Impact |
|---|---|---|---|
| 1 | Meta-prompt your tools (WHAT + WHEN NOT + WHY) | Easy | High |
| 2 | Stream + parallel tool execution | Hard | High |
| 3 | Modular prompt sections (static first, dynamic last) | Easy | High |
| 4 | Three-tier compaction (microcompact → summarize → truncate) | Medium | High |
| 5 | Mechanical safety layer (validate before execute) | Medium | High |
| 6 | “Information will disappear” prompt | Easy | Medium |
| 7 | Typed memory system (user/feedback/project/reference) | Medium | Medium |
| 8 | Separate memory extraction agent (restricted tools, turn limit) | Medium | Medium |
| 9 | Deferred tool loading (name-only + on-demand schema) | Easy | Medium |
| 10 | Principle-based safety (“consider reversibility”) | Easy | Medium |
The Real Moat
None of these patterns works in isolation. The prompt cache strategy shapes the prompt structure. The prompt structure shapes how tool descriptions are written. The tool descriptions shape what the permission system needs to enforce. The permission system shapes how the memory extraction agent is scoped. The memory design shapes what context management needs to preserve.
Each design decision reinforces the others. That’s the moat — not any individual feature, but the coherence between all of them.
The single biggest lesson: Claude Code treats prompt engineering as a first-class engineering discipline — versioned, measured, A/B tested, and architected with the same rigor as the runtime code. The gap between that approach and treating prompts as config strings is where most of the performance difference lives.
More recent articles
- Does Claude Code Test Itself? Yes — Here's What's Actually in the Source - 31st March 2026
- Multiple MCP Servers Through Amazon Bedrock AgentCore Gateway - 31st March 2026