Akshay Parkhi's Weblog

Subscribe

Does Claude Code Test Itself? Yes — Here’s What’s Actually in the Source

31st March 2026

Anthropic published a blog post on demystifying evals for AI agents. It recommends three grader types, eight setup steps, and a feedback loop from production back into improvement decisions. What makes this interesting is what the Claude Code source code reveals: the product doesn’t just follow the philosophy — it IS the eval system.

The Eval Framework

The blog organizes graders into three types:

TypeMethodsCharacteristics
Code-basedString match, test pass/fail, outcome verification, tool call verificationFast, cheap, deterministic
Model-basedRubric scoring, natural language assertions, pairwise comparison, multi-judge consensusFlexible, scales to complex behaviors
HumanSME review, crowdsourcing, spot-checks, A/B testingGold standard — but expensive

Two distinct purposes for eval suites:

TypeGoalTarget pass rate
Capability evalsWhat can it do? Hill-climb target.Start low — room to improve
Regression evalsDoes it still work? Safety net.~100% — any drop is a signal

Two metrics with a subtle but important difference:

  • pass@k — at least 1 of k trials succeeds. Optimistic. Good for capability measurement.
  • pass^k — ALL k trials succeed. Pessimistic. Correct for production reliability. A 75% per-trial rate across 3 trials gives (0.75)³ ≈ 42% pass^k. That means a user asking the same question three times would see all three succeed less than half the time.

The core grading principle: grade what the agent produced, not the path it took. Check whether tests pass, whether the file is correct, whether the outcome matches the spec. Don’t penalize creative but valid approaches.

The 8-Step Eval Roadmap

  1. Start early — 20–50 tasks drawn from real failures
  2. Convert manual tests to automated — remove human bottlenecks
  3. Write unambiguous tasks with reference solutions — ambiguity produces noisy scores
  4. Build balanced problem sets — positive and negative cases, edge cases
  5. Isolated, stable environments — clean state per trial, no cross-contamination
  6. Thoughtful graders — deterministic where possible, model-based where not
  7. Read transcripts — don’t trust scores blindly; graders can be wrong too
  8. Monitor saturation — 100% pass rate means no signal; replace with harder tasks

What’s Actually in the Source Code

The Claude Code source (visible in the community-analyzed repository) implements a production observability and experimentation infrastructure that maps precisely to these recommendations.

1. Telemetry — 43+ tracked events

Every agent session emits structured telemetry covering four categories:

API RELIABILITY
├─ tengu_api_error       → error type, status code, model
└─ tengu_model_fallback  → original_model → fallback_model

TOOL EXECUTION
├─ tengu_tool_use_success → toolName, duration_ms
├─ tengu_tool_use_error   → error, errorCode, toolName
└─ tengu_tool_use_*       → 8 variants by approval source

PERMISSION FLOW
├─ granted_in_config          → auto-approved by allowlist
├─ granted_by_classifier      → ML-approved
├─ granted_by_hook            → hook-approved
├─ granted_in_prompt_*        → user approved (permanent/temp)
└─ rejected_in_prompt         → user denied

SESSION HEALTH
├─ tengu_init / started / exit / cancel
├─ tengu_flicker              → visual stability regression
├─ tengu_compact_failed       → compaction failures
└─ tengu_uncaught_exception   → unhandled errors

Every event is enriched with: model, platform, version, subscriptionType, userType, sessionId, messageId, requestId, and userBucket (1 of 30 hashed buckets for sampling).

2. A/B Testing — GrowthBook experiment infrastructure

The codebase contains a full experiment platform with user targeting attributes:

User attributes for targeting:
├─ id, sessionId, deviceID
├─ platform (win32 / darwin / linux)
├─ organizationUUID, accountUUID
├─ userType (ant vs external)
├─ subscriptionType (free / paid)
├─ rateLimitTier
├─ appVersion
└─ email, github metadata

When a user is assigned to an experiment, the exposure event captures: experimentId, variantId, full user attributes at assignment time. Events flow to /api/event_logging/batch and then to BigQuery.

Three feature flag read patterns are used:

  • CACHED_MAY_BE_STALE — non-blocking, safe to use at startup
  • CACHED_OR_BLOCKING — for user-invoked features where freshness matters
  • Env var overrides via CLAUDE_INTERNAL_FC_OVERRIDES — for eval harness use

GrowthBook refreshes every 6 hours for external users, every 20 minutes for internal Anthropic employees — who get new experiments first.

3. OpenTelemetry Tracing — Full request lifecycle

Each agent turn generates a structured trace:

Turn Span (full turn duration)
├─ LLM Request Span
│    attrs: model, message_count, token counts
│
├─ Tool Execution Span
│    attrs: tool_name, duration_ms
│    │
│    ├─ User Blocking Span (if permission needed)
│    │    attrs: wait_duration_ms
│    │
│    └─ Tool Operation Span
│         attrs: result_size, error (if any)
│
└─ Hook Span (if hooks ran)

Traces export via OTLP (gRPC or HTTP) to the Anthropic backend, plus Perfetto traces for local Chrome DevTools debugging. Orphaned spans have a 30-minute TTL.

4. Privacy-Safe Telemetry by Design

Analytics fields must pass through a marker type:

type AnalyticsMetadata = {
  metadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS: string
}

A developer must attest in the type signature that a field doesn’t contain PII, code, or file paths. The compiler enforces this — you cannot accidentally log sensitive data. Additional safeguards: MCP tool names are sanitized, user IDs are hashed into 30 buckets, tool inputs are truncated to 512 characters with a 4KB JSON cap, and proto fields are stripped before Datadog dispatch.

5. 40+ Feature Flags

A representative sample of what’s gated:

FlagWhat it gates
tengu_concise_v2Output concision prompt changes
tengu_auto_mode_*Classifier-based permission approval
tengu_amber_flintAgent swarms / team mode
tengu_penguins_offFast mode killswitch
tengu_tool_pearStrict tool use format
tengu_bramble_lintelMemory extraction frequency
tengu_frond_boricAnalytics sink killswitches
TRANSCRIPT_CLASSIFIERML-based permission classification
BASH_CLASSIFIERBash command safety classification
CONTEXT_COLLAPSEContext collapse feature
COORDINATOR_MODEMulti-agent orchestration
DAEMONBackground daemon mode
AGENT_TRIGGERSScheduled agent triggers

How the Blog Maps to the Code

Blog recommendationWhat Claude Code actually does
Start with manual tests from real failuresStarted with Anthropic employee dogfooding, then formalized
Code-based graders: outcome verification43+ telemetry events — tool success/fail, token counts, cache hits
Model-based graders: rubric scoringTRANSCRIPT_CLASSIFIER and BASH_CLASSIFIER for safety decisions
Human graders: gold standardUser approve/reject decisions with feedback flag; real A/B testing sessions
A/B testing with trafficGrowthBook with 30-bucket user hashing and BigQuery pipeline
Production monitoringDatadog (43 event types) + OpenTelemetry + Perfetto
Capability vs regression splitFeature flags gate new behaviors (capability); telemetry catches regressions in existing metrics
Grade outcomes, not pathsTracks tool_use_success/error — not “did it use the right tool sequence”
Read transcriptsSidechain transcripts per agent, session recording, resume system
Isolated environmentsGit worktree isolation for agents, sandbox for bash, clean state per trial

The Full Testing Feedback Loop

Putting it together, the cycle that runs continuously:

1. INFRASTRUCTURE
   ├─ Internal "ant" users get experiments first (20min refresh)
   ├─ Env var overrides for eval harnesses
   └─ /config Gates tab for developer debugging

2. HYPOTHESIS
   ├─ Create GrowthBook experiment
   ├─ Gate prompt section with feature flag
   └─ Roll out to 5% of internal users

3. MEASUREMENT (automated, continuous)
   ├─ Telemetry events → Datadog dashboards
   ├─ OTel traces → per-turn breakdown
   └─ Control vs variant comparison:
        - Output tokens per turn
        - Tool success rate
        - User cancellation rate
        - Cache hit rate
        - Session duration

4. DECISION
   ├─ Wins?      → Roll to 100% external users
   ├─ Regresses? → Kill experiment
   ├─ Unclear?   → Expand to 20%, gather more data
   └─ Incident?  → Killswitch fires immediately

5. REGRESSION GUARD
   ├─ Existing telemetry becomes regression baseline
   ├─ Cache break detection (12 checks)
   ├─ tengu_flicker detects visual stability regressions
   └─ Model fallback tracking catches API reliability drops

What to Steal for Your Agent System

PatternEffortImpact
Instrument from day 1: tool success/fail, tokens, latency, user interruptsEasyHigh
Grade outcomes not paths — did the task succeed, not which tools were calledEasyHigh
Feature-flag all prompt changes; roll to 5% → measure → expandMediumHigh
Three grader types: deterministic + model-based + human spot-checksMediumHigh
Capability evals (hard, low pass rate) + regression evals (easy, ~100%)EasyMedium
Privacy-safe telemetry by default — type system prevents PII loggingMediumMedium
Read 10 transcripts per week minimum — scores alone hide grader failuresFreeMedium
Every bug report becomes a new eval task — your support queue seeds the suiteEasyMedium
Measure pass^k not just pass@k — production reliability compounds across trialsEasyMedium
Killswitches for every major feature — plan for instant rollbackEasyMedium

The Meta-Insight

Claude Code doesn’t just run evals. It IS the eval system. Every user session is a production eval:

  • 43+ telemetry events per session → code-based grading
  • ML classifiers judging safety decisions → model-based grading
  • User approve/reject decisions → human grading
  • GrowthBook experiments running in parallel → A/B testing
  • OTel traces per turn → performance profiling
  • Sidechain recordings → session replay and transcript review

Every prompt change is gated behind a feature flag, measured against existing telemetry baselines, and either rolled out or killed based on observed data. The “1.2% token reduction vs qualitative ’be concise’” result quoted in their design documentation is a measured outcome from this exact loop — not an estimate.

The takeaway: don’t build evals as a separate project. Build your agent so that every production session generates graded data. Instrument from day one. Feature-flag from day one. The eval suite is not a phase that comes after the product ships — it’s the same system.

This is Does Claude Code Test Itself? Yes — Here’s What’s Actually in the Source by Akshay Parkhi, posted on 31st March 2026.

Previous: Claude Code's Design Philosophy: 10 Patterns to Steal for Your Agent Systems