From Prompt Engineering to Harness Engineering: Building Infrastructure for Autonomous Agents

18th March 2026

2025 was the year of agents. 2026 is the year of harnesses — the persistent infrastructure that gives a foundation model hands, feet, and senses. The shift is fundamental: from prompt engineering (optimizing single interactions) to harness engineering (building the systems that control long-running, autonomous agents).

What Is a Harness?

A harness is the software layer wrapping a foundational model. It manages tool access, keeps track of progress, and recovers when the model fails. Standard chat models are “question to answer.” Agents are “goal to result.” The harness is what makes that difference possible.

+-------------------------------------------------------+
|                  THE HARNESS LAYER                     |
|                                                        |
|   +-------------+    +-------------+    +-----------+  |
|   |   Context    |    |    Tool     |    |  Memory   |  |
|   |  Management  |    |   Access    |    |  System   |  |
|   +------+------+    +------+------+    +-----+-----+  |
|          |                  |                  |        |
|   +------v------------------v------------------v-----+  |
|   |              ORCHESTRATION LOOP                   |  |
|   |   reason -> act -> observe -> reason -> ...      |  |
|   +---------------------------+-----------------------+  |
|                               |                        |
|   +---------------------------v-----------------------+  |
|   |              FOUNDATION MODEL (LLM)               |  |
|   +---------------------------------------------------+  |
+-------------------------------------------------------+

Intelligence increasingly resides in the scaffolding — the reasoning, memory systems, and tool optimization — rather than the raw power of the LLM.

Context Management: The Hardest Problem

Managing the context window is the most difficult engineering challenge in creating reliable agents. Even models with million-token windows face performance degradation as the window fills up. Performance begins to rot once a window is roughly 40% full, leading to lost signal and poor instruction following.

The Playbook: Reduce, Offload, Isolate

+-------------------+-------------------------------------------+
|    Strategy       |    How It Works                           |
+-------------------+-------------------------------------------+
|                   |                                           |
|    REDUCE         |    Prune old tool results, summarize      |
|                   |    conversation trajectories, keep        |
|                   |    context lean                           |
|                   |                                           |
+-------------------+-------------------------------------------+
|                   |                                           |
|    OFFLOAD        |    Use file system or database as         |
|                   |    external long-term memory instead      |
|                   |    of cramming into the prompt            |
|                   |                                           |
+-------------------+-------------------------------------------+
|                   |                                           |
|    ISOLATE        |    Use sub-agents for token-heavy         |
|                   |    tasks (research, debugging) to         |
|                   |    keep orchestrator context clean        |
|                   |                                           |
+-------------------+-------------------------------------------+

This is why every serious coding agent — Claude Code, OpenCode, Pi — uses sub-agents. It’s not just about parallelism. It’s about protecting the main context window.

The Initializer-Coder Pattern

The industry standard for multi-hour or multi-day tasks. Never ask an agent to build an entire complex application in one shot — that leads to implementation failures and context amnesia.

PHASE 1: THE INITIALIZER (runs once)
  |
  |-- Reads the specification
  |-- Creates machine-readable feature list (JSON)
  |-- Every task marked "failed" by default
  |-- Sets up environment (init.sh)
  |
  v
PHASE 2: THE TASK AGENT (iterates)
  |
  |-- Picks one feature at a time
  |-- Implements it
  |-- Verifies it (tests pass?)
  |-- Commits progress
  |-- Updates feature status to "passed"
  |-- Picks next feature
  |-- Repeats until done

The Four Artifacts

Continuity across discrete sessions is maintained through four core artifacts:

features.json — machine-readable task list with pass/fail status
init.sh — environment initialization script
progress.md — narrative progress log
Git history — descriptive commits as a narrative timeline

Bash Is All You Need

A major insight shared by Vercel, Anthropic, and independent builders: models perform better with generic, code-native tools than with bespoke, complex tool schemas.

Instead of building 100 specialized tools, give the agent access to a Bash tool and a file system. The model writes its own scripts to solve problems, expanding its action space dramatically without bloating the system prompt.

Approach	Tools	Accuracy	Speed
Specialized tools (100+)	Custom schema per task	80%	Baseline
Bash + filesystem	2 generic tools	100%	3.5x faster

Vercel saw this exact result with a text-to-SQL agent: removing 80% of specialized tools and replacing them with a Bash terminal jumped accuracy from 80% to 100% while running 3.5x faster.

Skills as SOPs for AI

Skills are folders containing scripts and instructions that an agent picks up only when needed. They reduce cognitive load and prevent context pollution — the agent doesn’t carry knowledge about deploying to AWS until it actually needs to deploy.

Verification and Reliability

Reliability in agentic systems drops exponentially with steps. A 95% success rate on single steps becomes only 36% over a 20-step task.

Step success rate: 95%

 1 step:  95.0%
 5 steps: 77.4%
10 steps: 59.9%
20 steps: 35.8%   <-- this is where most real tasks live
50 steps: 7.7%

The fix is deterministic feedback built into the harness:

Automated tests — unit tests, linting, type checking after every change
Eyes — Puppeteer or Chrome DevTools to verify UI changes the model can’t see in code alone
Human-in-the-loop — strategic checkpoints for high-risk operations (ad budgets, production merges)
Self-correction — let models read their own error logs and iterate until tests pass

Agentic DevOps

A new discipline is emerging that applies DevOps principles to autonomous agents:

+-----------------+------------------------------------------+
|   Principle     |   Applied to Agents                      |
+-----------------+------------------------------------------+
|   Guardrails    |   Permission scoping, restricted tools   |
|   Golden paths  |   CLAUDE.md, agents.md, coding standards |
|   Safety nets   |   Git commits, rollback, test suites     |
|   Manual review |   HITL checkpoints at critical steps     |
+-----------------+------------------------------------------+

The Builder’s Checklist

Start simple. Don’t jump to agents if a structured workflow or a single prompt will suffice.
Onboard your agent. Treat it like a new employee. Create an agents.md or CLAUDE.md file — the source of truth for roles, business context, and coding standards.
Implement a memory loop. Tell the agent to update a memory.md file whenever it learns a new preference or corrects a mistake.
Embrace the bitter lesson. As models improve, remove the crutches. Simpler systems that scale with compute eventually win.
Use Git for state. Always require the agent to commit with descriptive messages. The Git log is a narrative history future agents can read.
Leverage MCP. Use the Model Context Protocol to connect your agent to external data sources (Google Drive, Slack, GitHub) in a standardized way.

The Bottom Line

2025: "How smart is the model?"
2026: "How good is the harness?"

The model is the engine.
The harness is the car.

Nobody wins a race with just an engine.

The intelligence ceiling keeps rising. The bottleneck is no longer the model — it’s the infrastructure around it. Context management, tool design, verification loops, and session continuity. That’s where the real engineering happens now.

Posted 18th March 2026 at 1:07 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog