Does Claude Code Test Itself? Yes — Here’s What’s Actually in the Source
31st March 2026
Anthropic published a blog post on demystifying evals for AI agents. It recommends three grader types, eight setup steps, and a feedback loop from production back into improvement decisions. What makes this interesting is what the Claude Code source code reveals: the product doesn’t just follow the philosophy — it IS the eval system.
The Eval Framework
The blog organizes graders into three types:
| Type | Methods | Characteristics |
|---|---|---|
| Code-based | String match, test pass/fail, outcome verification, tool call verification | Fast, cheap, deterministic |
| Model-based | Rubric scoring, natural language assertions, pairwise comparison, multi-judge consensus | Flexible, scales to complex behaviors |
| Human | SME review, crowdsourcing, spot-checks, A/B testing | Gold standard — but expensive |
Two distinct purposes for eval suites:
| Type | Goal | Target pass rate |
|---|---|---|
| Capability evals | What can it do? Hill-climb target. | Start low — room to improve |
| Regression evals | Does it still work? Safety net. | ~100% — any drop is a signal |
Two metrics with a subtle but important difference:
- pass@k — at least 1 of k trials succeeds. Optimistic. Good for capability measurement.
- pass^k — ALL k trials succeed. Pessimistic. Correct for production reliability. A 75% per-trial rate across 3 trials gives (0.75)³ ≈ 42% pass^k. That means a user asking the same question three times would see all three succeed less than half the time.
The core grading principle: grade what the agent produced, not the path it took. Check whether tests pass, whether the file is correct, whether the outcome matches the spec. Don’t penalize creative but valid approaches.
The 8-Step Eval Roadmap
- Start early — 20–50 tasks drawn from real failures
- Convert manual tests to automated — remove human bottlenecks
- Write unambiguous tasks with reference solutions — ambiguity produces noisy scores
- Build balanced problem sets — positive and negative cases, edge cases
- Isolated, stable environments — clean state per trial, no cross-contamination
- Thoughtful graders — deterministic where possible, model-based where not
- Read transcripts — don’t trust scores blindly; graders can be wrong too
- Monitor saturation — 100% pass rate means no signal; replace with harder tasks
What’s Actually in the Source Code
The Claude Code source (visible in the community-analyzed repository) implements a production observability and experimentation infrastructure that maps precisely to these recommendations.
1. Telemetry — 43+ tracked events
Every agent session emits structured telemetry covering four categories:
API RELIABILITY
├─ tengu_api_error → error type, status code, model
└─ tengu_model_fallback → original_model → fallback_model
TOOL EXECUTION
├─ tengu_tool_use_success → toolName, duration_ms
├─ tengu_tool_use_error → error, errorCode, toolName
└─ tengu_tool_use_* → 8 variants by approval source
PERMISSION FLOW
├─ granted_in_config → auto-approved by allowlist
├─ granted_by_classifier → ML-approved
├─ granted_by_hook → hook-approved
├─ granted_in_prompt_* → user approved (permanent/temp)
└─ rejected_in_prompt → user denied
SESSION HEALTH
├─ tengu_init / started / exit / cancel
├─ tengu_flicker → visual stability regression
├─ tengu_compact_failed → compaction failures
└─ tengu_uncaught_exception → unhandled errors
Every event is enriched with: model, platform, version, subscriptionType, userType, sessionId, messageId, requestId, and userBucket (1 of 30 hashed buckets for sampling).
2. A/B Testing — GrowthBook experiment infrastructure
The codebase contains a full experiment platform with user targeting attributes:
User attributes for targeting:
├─ id, sessionId, deviceID
├─ platform (win32 / darwin / linux)
├─ organizationUUID, accountUUID
├─ userType (ant vs external)
├─ subscriptionType (free / paid)
├─ rateLimitTier
├─ appVersion
└─ email, github metadata
When a user is assigned to an experiment, the exposure event captures: experimentId, variantId, full user attributes at assignment time. Events flow to /api/event_logging/batch and then to BigQuery.
Three feature flag read patterns are used:
CACHED_MAY_BE_STALE— non-blocking, safe to use at startupCACHED_OR_BLOCKING— for user-invoked features where freshness matters- Env var overrides via
CLAUDE_INTERNAL_FC_OVERRIDES— for eval harness use
GrowthBook refreshes every 6 hours for external users, every 20 minutes for internal Anthropic employees — who get new experiments first.
3. OpenTelemetry Tracing — Full request lifecycle
Each agent turn generates a structured trace:
Turn Span (full turn duration)
├─ LLM Request Span
│ attrs: model, message_count, token counts
│
├─ Tool Execution Span
│ attrs: tool_name, duration_ms
│ │
│ ├─ User Blocking Span (if permission needed)
│ │ attrs: wait_duration_ms
│ │
│ └─ Tool Operation Span
│ attrs: result_size, error (if any)
│
└─ Hook Span (if hooks ran)
Traces export via OTLP (gRPC or HTTP) to the Anthropic backend, plus Perfetto traces for local Chrome DevTools debugging. Orphaned spans have a 30-minute TTL.
4. Privacy-Safe Telemetry by Design
Analytics fields must pass through a marker type:
type AnalyticsMetadata = {
metadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS: string
}
A developer must attest in the type signature that a field doesn’t contain PII, code, or file paths. The compiler enforces this — you cannot accidentally log sensitive data. Additional safeguards: MCP tool names are sanitized, user IDs are hashed into 30 buckets, tool inputs are truncated to 512 characters with a 4KB JSON cap, and proto fields are stripped before Datadog dispatch.
5. 40+ Feature Flags
A representative sample of what’s gated:
| Flag | What it gates |
|---|---|
tengu_concise_v2 | Output concision prompt changes |
tengu_auto_mode_* | Classifier-based permission approval |
tengu_amber_flint | Agent swarms / team mode |
tengu_penguins_off | Fast mode killswitch |
tengu_tool_pear | Strict tool use format |
tengu_bramble_lintel | Memory extraction frequency |
tengu_frond_boric | Analytics sink killswitches |
TRANSCRIPT_CLASSIFIER | ML-based permission classification |
BASH_CLASSIFIER | Bash command safety classification |
CONTEXT_COLLAPSE | Context collapse feature |
COORDINATOR_MODE | Multi-agent orchestration |
DAEMON | Background daemon mode |
AGENT_TRIGGERS | Scheduled agent triggers |
How the Blog Maps to the Code
| Blog recommendation | What Claude Code actually does |
|---|---|
| Start with manual tests from real failures | Started with Anthropic employee dogfooding, then formalized |
| Code-based graders: outcome verification | 43+ telemetry events — tool success/fail, token counts, cache hits |
| Model-based graders: rubric scoring | TRANSCRIPT_CLASSIFIER and BASH_CLASSIFIER for safety decisions |
| Human graders: gold standard | User approve/reject decisions with feedback flag; real A/B testing sessions |
| A/B testing with traffic | GrowthBook with 30-bucket user hashing and BigQuery pipeline |
| Production monitoring | Datadog (43 event types) + OpenTelemetry + Perfetto |
| Capability vs regression split | Feature flags gate new behaviors (capability); telemetry catches regressions in existing metrics |
| Grade outcomes, not paths | Tracks tool_use_success/error — not “did it use the right tool sequence” |
| Read transcripts | Sidechain transcripts per agent, session recording, resume system |
| Isolated environments | Git worktree isolation for agents, sandbox for bash, clean state per trial |
The Full Testing Feedback Loop
Putting it together, the cycle that runs continuously:
1. INFRASTRUCTURE
├─ Internal "ant" users get experiments first (20min refresh)
├─ Env var overrides for eval harnesses
└─ /config Gates tab for developer debugging
2. HYPOTHESIS
├─ Create GrowthBook experiment
├─ Gate prompt section with feature flag
└─ Roll out to 5% of internal users
3. MEASUREMENT (automated, continuous)
├─ Telemetry events → Datadog dashboards
├─ OTel traces → per-turn breakdown
└─ Control vs variant comparison:
- Output tokens per turn
- Tool success rate
- User cancellation rate
- Cache hit rate
- Session duration
4. DECISION
├─ Wins? → Roll to 100% external users
├─ Regresses? → Kill experiment
├─ Unclear? → Expand to 20%, gather more data
└─ Incident? → Killswitch fires immediately
5. REGRESSION GUARD
├─ Existing telemetry becomes regression baseline
├─ Cache break detection (12 checks)
├─ tengu_flicker detects visual stability regressions
└─ Model fallback tracking catches API reliability drops
What to Steal for Your Agent System
| Pattern | Effort | Impact |
|---|---|---|
| Instrument from day 1: tool success/fail, tokens, latency, user interrupts | Easy | High |
| Grade outcomes not paths — did the task succeed, not which tools were called | Easy | High |
| Feature-flag all prompt changes; roll to 5% → measure → expand | Medium | High |
| Three grader types: deterministic + model-based + human spot-checks | Medium | High |
| Capability evals (hard, low pass rate) + regression evals (easy, ~100%) | Easy | Medium |
| Privacy-safe telemetry by default — type system prevents PII logging | Medium | Medium |
| Read 10 transcripts per week minimum — scores alone hide grader failures | Free | Medium |
| Every bug report becomes a new eval task — your support queue seeds the suite | Easy | Medium |
| Measure pass^k not just pass@k — production reliability compounds across trials | Easy | Medium |
| Killswitches for every major feature — plan for instant rollback | Easy | Medium |
The Meta-Insight
Claude Code doesn’t just run evals. It IS the eval system. Every user session is a production eval:
- 43+ telemetry events per session → code-based grading
- ML classifiers judging safety decisions → model-based grading
- User approve/reject decisions → human grading
- GrowthBook experiments running in parallel → A/B testing
- OTel traces per turn → performance profiling
- Sidechain recordings → session replay and transcript review
Every prompt change is gated behind a feature flag, measured against existing telemetry baselines, and either rolled out or killed based on observed data. The “1.2% token reduction vs qualitative ’be concise’” result quoted in their design documentation is a measured outcome from this exact loop — not an estimate.
The takeaway: don’t build evals as a separate project. Build your agent so that every production session generates graded data. Instrument from day one. Feature-flag from day one. The eval suite is not a phase that comes after the product ships — it’s the same system.
More recent articles
- Claude Code's Design Philosophy: 10 Patterns to Steal for Your Agent Systems - 31st March 2026
- Multiple MCP Servers Through Amazon Bedrock AgentCore Gateway - 31st March 2026