From Prompt Engineering to Harness Engineering: Building Infrastructure for Autonomous Agents
18th March 2026
2025 was the year of agents. 2026 is the year of harnesses — the persistent infrastructure that gives a foundation model hands, feet, and senses. The shift is fundamental: from prompt engineering (optimizing single interactions) to harness engineering (building the systems that control long-running, autonomous agents).
What Is a Harness?
A harness is the software layer wrapping a foundational model. It manages tool access, keeps track of progress, and recovers when the model fails. Standard chat models are “question to answer.” Agents are “goal to result.” The harness is what makes that difference possible.
+-------------------------------------------------------+ | THE HARNESS LAYER | | | | +-------------+ +-------------+ +-----------+ | | | Context | | Tool | | Memory | | | | Management | | Access | | System | | | +------+------+ +------+------+ +-----+-----+ | | | | | | | +------v------------------v------------------v-----+ | | | ORCHESTRATION LOOP | | | | reason -> act -> observe -> reason -> ... | | | +---------------------------+-----------------------+ | | | | | +---------------------------v-----------------------+ | | | FOUNDATION MODEL (LLM) | | | +---------------------------------------------------+ | +-------------------------------------------------------+
Intelligence increasingly resides in the scaffolding — the reasoning, memory systems, and tool optimization — rather than the raw power of the LLM.
Context Management: The Hardest Problem
Managing the context window is the most difficult engineering challenge in creating reliable agents. Even models with million-token windows face performance degradation as the window fills up. Performance begins to rot once a window is roughly 40% full, leading to lost signal and poor instruction following.
The Playbook: Reduce, Offload, Isolate
+-------------------+-------------------------------------------+ | Strategy | How It Works | +-------------------+-------------------------------------------+ | | | | REDUCE | Prune old tool results, summarize | | | conversation trajectories, keep | | | context lean | | | | +-------------------+-------------------------------------------+ | | | | OFFLOAD | Use file system or database as | | | external long-term memory instead | | | of cramming into the prompt | | | | +-------------------+-------------------------------------------+ | | | | ISOLATE | Use sub-agents for token-heavy | | | tasks (research, debugging) to | | | keep orchestrator context clean | | | | +-------------------+-------------------------------------------+
This is why every serious coding agent — Claude Code, OpenCode, Pi — uses sub-agents. It’s not just about parallelism. It’s about protecting the main context window.
The Initializer-Coder Pattern
The industry standard for multi-hour or multi-day tasks. Never ask an agent to build an entire complex application in one shot — that leads to implementation failures and context amnesia.
PHASE 1: THE INITIALIZER (runs once) | |-- Reads the specification |-- Creates machine-readable feature list (JSON) |-- Every task marked "failed" by default |-- Sets up environment (init.sh) | v PHASE 2: THE TASK AGENT (iterates) | |-- Picks one feature at a time |-- Implements it |-- Verifies it (tests pass?) |-- Commits progress |-- Updates feature status to "passed" |-- Picks next feature |-- Repeats until done
The Four Artifacts
Continuity across discrete sessions is maintained through four core artifacts:
- features.json — machine-readable task list with pass/fail status
- init.sh — environment initialization script
- progress.md — narrative progress log
- Git history — descriptive commits as a narrative timeline
Bash Is All You Need
A major insight shared by Vercel, Anthropic, and independent builders: models perform better with generic, code-native tools than with bespoke, complex tool schemas.
Instead of building 100 specialized tools, give the agent access to a Bash tool and a file system. The model writes its own scripts to solve problems, expanding its action space dramatically without bloating the system prompt.
| Approach | Tools | Accuracy | Speed |
|---|---|---|---|
| Specialized tools (100+) | Custom schema per task | 80% | Baseline |
| Bash + filesystem | 2 generic tools | 100% | 3.5x faster |
Vercel saw this exact result with a text-to-SQL agent: removing 80% of specialized tools and replacing them with a Bash terminal jumped accuracy from 80% to 100% while running 3.5x faster.
Skills as SOPs for AI
Skills are folders containing scripts and instructions that an agent picks up only when needed. They reduce cognitive load and prevent context pollution — the agent doesn’t carry knowledge about deploying to AWS until it actually needs to deploy.
Verification and Reliability
Reliability in agentic systems drops exponentially with steps. A 95% success rate on single steps becomes only 36% over a 20-step task.
Step success rate: 95% 1 step: 95.0% 5 steps: 77.4% 10 steps: 59.9% 20 steps: 35.8% <-- this is where most real tasks live 50 steps: 7.7%
The fix is deterministic feedback built into the harness:
- Automated tests — unit tests, linting, type checking after every change
- Eyes — Puppeteer or Chrome DevTools to verify UI changes the model can’t see in code alone
- Human-in-the-loop — strategic checkpoints for high-risk operations (ad budgets, production merges)
- Self-correction — let models read their own error logs and iterate until tests pass
Agentic DevOps
A new discipline is emerging that applies DevOps principles to autonomous agents:
+-----------------+------------------------------------------+ | Principle | Applied to Agents | +-----------------+------------------------------------------+ | Guardrails | Permission scoping, restricted tools | | Golden paths | CLAUDE.md, agents.md, coding standards | | Safety nets | Git commits, rollback, test suites | | Manual review | HITL checkpoints at critical steps | +-----------------+------------------------------------------+
The Builder’s Checklist
- Start simple. Don’t jump to agents if a structured workflow or a single prompt will suffice.
- Onboard your agent. Treat it like a new employee. Create an agents.md or CLAUDE.md file — the source of truth for roles, business context, and coding standards.
- Implement a memory loop. Tell the agent to update a memory.md file whenever it learns a new preference or corrects a mistake.
- Embrace the bitter lesson. As models improve, remove the crutches. Simpler systems that scale with compute eventually win.
- Use Git for state. Always require the agent to commit with descriptive messages. The Git log is a narrative history future agents can read.
- Leverage MCP. Use the Model Context Protocol to connect your agent to external data sources (Google Drive, Slack, GitHub) in a standardized way.
The Bottom Line
2025: "How smart is the model?" 2026: "How good is the harness?" The model is the engine. The harness is the car. Nobody wins a race with just an engine.
The intelligence ceiling keeps rising. The bottleneck is no longer the model — it’s the infrastructure around it. Context management, tool design, verification loops, and session continuity. That’s where the real engineering happens now.
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026