Does Claude Code Test Itself? Yes — Here’s What’s Actually in the Source

31st March 2026

Anthropic published a blog post on demystifying evals for AI agents. It recommends three grader types, eight setup steps, and a feedback loop from production back into improvement decisions. What makes this interesting is what the Claude Code source code reveals: the product doesn’t just follow the philosophy — it IS the eval system.

The Eval Framework

The blog organizes graders into three types:

Type	Methods	Characteristics
Code-based	String match, test pass/fail, outcome verification, tool call verification	Fast, cheap, deterministic
Model-based	Rubric scoring, natural language assertions, pairwise comparison, multi-judge consensus	Flexible, scales to complex behaviors
Human	SME review, crowdsourcing, spot-checks, A/B testing	Gold standard — but expensive

Two distinct purposes for eval suites:

Type	Goal	Target pass rate
Capability evals	What can it do? Hill-climb target.	Start low — room to improve
Regression evals	Does it still work? Safety net.	~100% — any drop is a signal

Two metrics with a subtle but important difference:

pass@k — at least 1 of k trials succeeds. Optimistic. Good for capability measurement.
pass^k — ALL k trials succeed. Pessimistic. Correct for production reliability. A 75% per-trial rate across 3 trials gives (0.75)³ ≈ 42% pass^k. That means a user asking the same question three times would see all three succeed less than half the time.

The core grading principle: grade what the agent produced, not the path it took. Check whether tests pass, whether the file is correct, whether the outcome matches the spec. Don’t penalize creative but valid approaches.

The 8-Step Eval Roadmap

Start early — 20–50 tasks drawn from real failures
Convert manual tests to automated — remove human bottlenecks
Write unambiguous tasks with reference solutions — ambiguity produces noisy scores
Build balanced problem sets — positive and negative cases, edge cases
Isolated, stable environments — clean state per trial, no cross-contamination
Thoughtful graders — deterministic where possible, model-based where not
Read transcripts — don’t trust scores blindly; graders can be wrong too
Monitor saturation — 100% pass rate means no signal; replace with harder tasks

What’s Actually in the Source Code

The Claude Code source (visible in the community-analyzed repository) implements a production observability and experimentation infrastructure that maps precisely to these recommendations.

1. Telemetry — 43+ tracked events

Every agent session emits structured telemetry covering four categories:

API RELIABILITY
├─ tengu_api_error       → error type, status code, model
└─ tengu_model_fallback  → original_model → fallback_model

TOOL EXECUTION
├─ tengu_tool_use_success → toolName, duration_ms
├─ tengu_tool_use_error   → error, errorCode, toolName
└─ tengu_tool_use_*       → 8 variants by approval source

PERMISSION FLOW
├─ granted_in_config          → auto-approved by allowlist
├─ granted_by_classifier      → ML-approved
├─ granted_by_hook            → hook-approved
├─ granted_in_prompt_*        → user approved (permanent/temp)
└─ rejected_in_prompt         → user denied

SESSION HEALTH
├─ tengu_init / started / exit / cancel
├─ tengu_flicker              → visual stability regression
├─ tengu_compact_failed       → compaction failures
└─ tengu_uncaught_exception   → unhandled errors

Every event is enriched with: model, platform, version, subscriptionType, userType, sessionId, messageId, requestId, and userBucket (1 of 30 hashed buckets for sampling).

2. A/B Testing — GrowthBook experiment infrastructure

The codebase contains a full experiment platform with user targeting attributes:

User attributes for targeting:
├─ id, sessionId, deviceID
├─ platform (win32 / darwin / linux)
├─ organizationUUID, accountUUID
├─ userType (ant vs external)
├─ subscriptionType (free / paid)
├─ rateLimitTier
├─ appVersion
└─ email, github metadata

When a user is assigned to an experiment, the exposure event captures: experimentId, variantId, full user attributes at assignment time. Events flow to /api/event_logging/batch and then to BigQuery.

Three feature flag read patterns are used:

CACHED_MAY_BE_STALE — non-blocking, safe to use at startup
CACHED_OR_BLOCKING — for user-invoked features where freshness matters
Env var overrides via CLAUDE_INTERNAL_FC_OVERRIDES — for eval harness use

GrowthBook refreshes every 6 hours for external users, every 20 minutes for internal Anthropic employees — who get new experiments first.

3. OpenTelemetry Tracing — Full request lifecycle

Each agent turn generates a structured trace:

Turn Span (full turn duration)
├─ LLM Request Span
│    attrs: model, message_count, token counts
│
├─ Tool Execution Span
│    attrs: tool_name, duration_ms
│    │
│    ├─ User Blocking Span (if permission needed)
│    │    attrs: wait_duration_ms
│    │
│    └─ Tool Operation Span
│         attrs: result_size, error (if any)
│
└─ Hook Span (if hooks ran)

Traces export via OTLP (gRPC or HTTP) to the Anthropic backend, plus Perfetto traces for local Chrome DevTools debugging. Orphaned spans have a 30-minute TTL.

4. Privacy-Safe Telemetry by Design

Analytics fields must pass through a marker type:

type AnalyticsMetadata = {
  metadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS: string
}

A developer must attest in the type signature that a field doesn’t contain PII, code, or file paths. The compiler enforces this — you cannot accidentally log sensitive data. Additional safeguards: MCP tool names are sanitized, user IDs are hashed into 30 buckets, tool inputs are truncated to 512 characters with a 4KB JSON cap, and proto fields are stripped before Datadog dispatch.

5. 40+ Feature Flags

A representative sample of what’s gated:

Flag	What it gates
`tengu_concise_v2`	Output concision prompt changes
`tengu_auto_mode_*`	Classifier-based permission approval
`tengu_amber_flint`	Agent swarms / team mode
`tengu_penguins_off`	Fast mode killswitch
`tengu_tool_pear`	Strict tool use format
`tengu_bramble_lintel`	Memory extraction frequency
`tengu_frond_boric`	Analytics sink killswitches
`TRANSCRIPT_CLASSIFIER`	ML-based permission classification
`BASH_CLASSIFIER`	Bash command safety classification
`CONTEXT_COLLAPSE`	Context collapse feature
`COORDINATOR_MODE`	Multi-agent orchestration
`DAEMON`	Background daemon mode
`AGENT_TRIGGERS`	Scheduled agent triggers

How the Blog Maps to the Code

Blog recommendation	What Claude Code actually does
Start with manual tests from real failures	Started with Anthropic employee dogfooding, then formalized
Code-based graders: outcome verification	43+ telemetry events — tool success/fail, token counts, cache hits
Model-based graders: rubric scoring	`TRANSCRIPT_CLASSIFIER` and `BASH_CLASSIFIER` for safety decisions
Human graders: gold standard	User approve/reject decisions with feedback flag; real A/B testing sessions
A/B testing with traffic	GrowthBook with 30-bucket user hashing and BigQuery pipeline
Production monitoring	Datadog (43 event types) + OpenTelemetry + Perfetto
Capability vs regression split	Feature flags gate new behaviors (capability); telemetry catches regressions in existing metrics
Grade outcomes, not paths	Tracks tool_use_success/error — not “did it use the right tool sequence”
Read transcripts	Sidechain transcripts per agent, session recording, resume system
Isolated environments	Git worktree isolation for agents, sandbox for bash, clean state per trial

The Full Testing Feedback Loop

Putting it together, the cycle that runs continuously:

1. INFRASTRUCTURE
   ├─ Internal "ant" users get experiments first (20min refresh)
   ├─ Env var overrides for eval harnesses
   └─ /config Gates tab for developer debugging

2. HYPOTHESIS
   ├─ Create GrowthBook experiment
   ├─ Gate prompt section with feature flag
   └─ Roll out to 5% of internal users

3. MEASUREMENT (automated, continuous)
   ├─ Telemetry events → Datadog dashboards
   ├─ OTel traces → per-turn breakdown
   └─ Control vs variant comparison:
        - Output tokens per turn
        - Tool success rate
        - User cancellation rate
        - Cache hit rate
        - Session duration

4. DECISION
   ├─ Wins?      → Roll to 100% external users
   ├─ Regresses? → Kill experiment
   ├─ Unclear?   → Expand to 20%, gather more data
   └─ Incident?  → Killswitch fires immediately

5. REGRESSION GUARD
   ├─ Existing telemetry becomes regression baseline
   ├─ Cache break detection (12 checks)
   ├─ tengu_flicker detects visual stability regressions
   └─ Model fallback tracking catches API reliability drops

What to Steal for Your Agent System

Pattern	Effort	Impact
Instrument from day 1: tool success/fail, tokens, latency, user interrupts	Easy	High
Grade outcomes not paths — did the task succeed, not which tools were called	Easy	High
Feature-flag all prompt changes; roll to 5% → measure → expand	Medium	High
Three grader types: deterministic + model-based + human spot-checks	Medium	High
Capability evals (hard, low pass rate) + regression evals (easy, ~100%)	Easy	Medium
Privacy-safe telemetry by default — type system prevents PII logging	Medium	Medium
Read 10 transcripts per week minimum — scores alone hide grader failures	Free	Medium
Every bug report becomes a new eval task — your support queue seeds the suite	Easy	Medium
Measure pass^k not just pass@k — production reliability compounds across trials	Easy	Medium
Killswitches for every major feature — plan for instant rollback	Easy	Medium

The Meta-Insight

Claude Code doesn’t just run evals. It IS the eval system. Every user session is a production eval:

43+ telemetry events per session → code-based grading
ML classifiers judging safety decisions → model-based grading
User approve/reject decisions → human grading
GrowthBook experiments running in parallel → A/B testing
OTel traces per turn → performance profiling
Sidechain recordings → session replay and transcript review

Every prompt change is gated behind a feature flag, measured against existing telemetry baselines, and either rolled out or killed based on observed data. The “1.2% token reduction vs qualitative ’be concise’” result quoted in their design documentation is a measured outcome from this exact loop — not an estimate.

The takeaway: don’t build evals as a separate project. Build your agent so that every production session generates graded data. Instrument from day one. Feature-flag from day one. The eval suite is not a phase that comes after the product ships — it’s the same system.

Posted 31st March 2026 at 1:24 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog