Finding the Perfect Prompt: Combining DSPy’s Optimization with Strands Agents for Cost-Effective Multi-Agent Systems

6th March 2026

How we used Bayesian optimization to find better prompts automatically — and made cheap models perform like expensive ones.

Full source code: github.com/avparkhi/dspy-strands-optimizer

Introduction

Every multi-agent system has a dirty secret: the quality of its outputs depends almost entirely on the system prompts wired into each agent. A misworded routing instruction sends queries to the wrong sub-agent. A vague research prompt produces shallow answers. A product recommendation prompt that forgets to mention prices is useless.

Most teams iterate on these prompts by hand — write, test, tweak, repeat. It works, but it’s slow, subjective, and doesn’t scale past a handful of agents.

What if you could automate the prompt search? Define what “good” looks like as a metric, hand over your training data, and let an optimizer find the best prompt for each agent?

That’s exactly what we built. We combined:

DSPy (Stanford NLP) — a framework that treats prompts as optimizable parameters and uses Bayesian search to find the best ones
Strands Agents (AWS) — a runtime framework for building multi-agent systems using the agents-as-tools pattern

DSPy finds the prompts offline. Strands runs the agents in production. They’re complementary, not competing.

The Problem

Consider a typical customer service multi-agent setup:

User Query
    │
    ▼
┌──────────────┐
│ Orchestrator │ ──→ Which agent should handle this?
└──────┬───────┘
       │
  ┌────┼────────────┐
  ▼    ▼            ▼
Research  Product   Trip
Agent     Agent     Agent

Each box needs a system prompt. The orchestrator needs routing logic. Each sub-agent needs domain expertise. That’s 4 prompts to get right, and they interact — a great research prompt is useless if the orchestrator never routes research queries to it.

The traditional approach:

Human writes prompt → tests on 5 examples → tweaks wording → tests again → ships it → hopes for the best

The DSPy approach:

DSPy generates 100 prompt candidates → scores all of them → generates better ones → finds the winner

How DSPy Finds the Right Prompt

We cloned the DSPy repository and read the actual optimization code. Here’s what happens under the hood.

Strategy 1: BootstrapFewShot — Find the Right Examples

Source: dspy/teleprompt/bootstrap.py

The simplest optimizer. It doesn’t change the instruction text — it finds the best few-shot examples to stuff into the prompt.

class BootstrapFewShot(Teleprompter):
    def compile(self, student, *, teacher=None, trainset):
        self._prepare_student_and_teacher(student, teacher)
        self._bootstrap()          # Run teacher on examples, keep good traces
        self.student = self._train()  # Insert best demos into student
        return self.student

How it works:

Takes your program (the “student”) and a “teacher” model
Runs the teacher on each training example, capturing the full execution trace — every input and output at every step
Checks if the output passes your metric (accuracy, keyword match, etc.)
If it passes, that trace becomes a bootstrapped demo — a worked example showing the model how to reason
Stuffs the best demos into the student’s prompt

The key insight: it uses the LM itself to generate correct worked examples, then feeds those back as few-shot demonstrations.

Strategy 2: COPRO — Let the LM Rewrite Its Own Instructions

Source: dspy/teleprompt/copro_optimizer.py

This one actually rewrites the instruction text. It’s hill-climbing where the LM is the search operator.

class BasicGenerateInstruction(Signature):
    """You are an instruction optimizer for large language models.
    I will give you a signature of fields (inputs and outputs) in English.
    Your task is to propose an instruction that will lead a good language
    model to perform the task well. Don't be afraid to be creative."""

    basic_instruction = dspy.InputField()
    proposed_instruction = dspy.OutputField()

The loop:

Start with your initial instruction (e.g., “Answer the question”)
Ask the LM to propose breadth=10 new instructions
Evaluate each by running the full program on training data and scoring with your metric
Feed the top scorers + their scores back:

class GenerateInstructionGivenAttempts(dspy.Signature):
    """I will give some task instructions I've tried, along with their
    validation scores. The instructions are arranged in increasing order
    based on their scores. Propose a new instruction that will perform
    even better."""

    attempted_instructions = dspy.InputField()
    proposed_instruction = dspy.OutputField()

Repeat for depth=3 iterations
Return the best-scoring program

Strategy 3: MIPROv2 — Bayesian Optimization Over Everything

Source: dspy/teleprompt/mipro_optimizer_v2.py

This is what we used. It combines instruction optimization + few-shot selection using Optuna’s Bayesian optimization.

Step 1: Bootstrap Few-Shot Demo Candidates

Same as BootstrapFewShot, but generates N independent sets of demos (we used N=6). Each set is created with a different random seed, giving diverse examples.

Step 2: Propose Instruction Candidates via GroundedProposer

This is the most interesting part. The GroundedProposer (dspy/propose/grounded_proposer.py) doesn’t just ask “write a better instruction.” It gives the LM rich context about the task:

class GroundedProposer(Proposer):
    def __init__(self, prompt_model, program, trainset, ...):
        # 1. Summarize the dataset
        self.data_summary = create_dataset_summary(trainset)

        # 2. Read the program's actual Python source code
        self.program_code_string = get_dspy_source_code(program)

When generating an instruction, it:

Summarizes your dataset — “This dataset contains user queries about travel, products, and research topics”
Reads your program’s source code — literally inspects the Python class and describes what it does
Describes each module’s role — “This predictor classifies queries into categories”
Picks a random tip from a curated set:

TIPS = {
    "none": "",
    "creative": "Don't be afraid to be creative when creating the new instruction!",
    "simple": "Keep the instruction clear and concise.",
    "description": "Make sure your instruction is very informative and descriptive.",
    "high_stakes": "The instruction should include a high stakes scenario!",
    "persona": 'Include a persona that is relevant to the task (ie. "You are a ...")',
}

Feeds everything to the LM with a unique rollout_id and temperature to bypass cache and get diverse candidates

Here’s what a generated instruction looked like for our router:

“Analyze the user’s query and classify it into exactly one of three specialist agent categories: research (factual, informational, or knowledge-based questions), product (shopping, purchasing, or product recommendation requests — often containing price ranges, brand names, or specifications), or trip (travel planning, itinerary creation, destination guidance). Carefully examine the query’s intent, key signals (such as budget constraints, travel dates, or factual subject matter), and domain to determine the most appropriate routing.”

That’s way more detailed than our original “Classify a user query and route it to the correct specialist agent” — and DSPy generated it automatically.

Step 3: Bayesian Search with Optuna

Now we have N instruction candidates and N demo sets per predictor. The search space is:

For each predictor:
  instruction ∈ {candidate_0, candidate_1, ..., candidate_N}
  demo_set   ∈ {demo_set_0, demo_set_1, ..., demo_set_N}

MIPROv2 uses Optuna’s Tree-structured Parzen Estimator (TPE) to search this space efficiently:

sampler = optuna.samplers.TPESampler(seed=seed, multivariate=True)
study = optuna.create_study(direction="maximize", sampler=sampler)

def objective(trial):
    # Pick instruction and demo set for each predictor
    for i, predictor in enumerate(program.predictors()):
        instruction_idx = trial.suggest_categorical(
            f"{i}_predictor_instruction", range(len(instruction_candidates[i]))
        )
        demos_idx = trial.suggest_categorical(
            f"{i}_predictor_demos", range(len(demo_candidates[i]))
        )
    # Evaluate on validation set
    score = evaluate(candidate_program)
    return score

study.optimize(objective, n_trials=num_trials)

Unlike random search, TPE learns which combinations score well and focuses future trials on promising regions. It also uses minibatching — evaluating on small subsets first, then doing full evaluation only on the top-averaging candidates.

The Complete Flow

Training Data
     │
     ▼
┌─────────────┐     ┌──────────────────┐
│ Bootstrap    │────▶│ N demo candidate │
│ Few-Shot     │     │ sets             │
└─────────────┘     └────────┬─────────┘
                             │
┌─────────────┐     ┌────────▼─────────┐
│ Grounded    │────▶│ N instruction    │
│ Proposer    │     │ candidates       │
└─────────────┘     └────────┬─────────┘
                             │
                    ┌────────▼─────────┐
  Optuna TPE       │   Bayesian       │──▶ Best (instruction, demos)
  - random tips    │   Search         │    combo per predictor
  - past attempts  └──────────────────┘

Building the Multi-Agent System

The DSPy Side: Defining What to Optimize

We defined 4 DSPy modules — one router and three specialists:

class RouteQuery(dspy.Signature):
    """Classify a user query and route it to the correct specialist agent."""
    query = dspy.InputField(desc="The user's question or request")
    agent = dspy.OutputField(desc="One of: research, product, trip")

class ResearchAnswer(dspy.Signature):
    """Answer a factual research question accurately and concisely."""
    query = dspy.InputField()
    answer = dspy.OutputField()

And metrics that define “good”:

def routing_metric(example, pred, trace=None):
    """Did it pick the right agent?"""
    return pred.agent.strip().lower() == example.expected_agent.strip().lower()

def answer_quality_metric(example, pred, trace=None):
    """Does the answer contain expected keywords?"""
    expected_words = set(example.answer.lower().split())
    matched = sum(1 for w in expected_words if w in pred.answer.lower())
    return matched / max(len(expected_words), 1) >= 0.4

Then we let MIPROv2 loose:

optimizer = dspy.MIPROv2(
    metric=routing_metric,
    auto="light",                    # 6 candidates, 10 trials
    max_bootstrapped_demos=3,
    max_labeled_demos=3,
)
optimized = optimizer.compile(RouterModule(), trainset=ROUTING_EXAMPLES)

The Strands Side: Deploying Optimized Prompts

After optimization, we extract the winning prompts and build Strands agents:

from strands import Agent, tool
from strands.models import BedrockModel

# Load DSPy-optimized prompts
prompts = json.load(open("optimized_prompts.json"))

@tool
def research_assistant(query: str) -> str:
    """Handle factual research questions about science, history, how things work."""
    agent = Agent(
        system_prompt=build_system_prompt(prompts["research"]),
        model=BedrockModel(model_id="us.anthropic.claude-sonnet-4-6"),
        callback_handler=None,
    )
    return str(agent(query))

# Orchestrator uses all sub-agents as tools
orchestrator = Agent(
    system_prompt=build_system_prompt(prompts["router"]),
    model=BedrockModel(model_id="us.anthropic.claude-sonnet-4-6"),
    tools=[research_assistant, product_assistant, trip_assistant],
)

The build_system_prompt function converts DSPy’s structured output (instruction + demos) into a Strands-compatible system prompt string, including the few-shot examples that DSPy found.

Results

Optimization Performance

All models used: Claude Sonnet 4.6 on AWS Bedrock (us.anthropic.claude-sonnet-4-6).

Agent	Optimization Method	Sonnet 4.6 Score	Haiku 4.5 Score	Notes
Router	MIPROv2 (10 trials)	100%	100%	All 10 trials scored 100%
Research	BootstrapFewShot	100%	100%	3 bootstrapped demos
Product	BootstrapFewShot	100%	50%	Only 2 training examples
Trip	BootstrapFewShot	100%	100%	2 bootstrapped demos

Optimization time: ~13 minutes for all 4 agents
LLM calls: ~3,000 for the router alone (10 trials × 12 eval examples × multiple candidates)
Cost: Approximately $5-10 in Bedrock API costs

What MIPROv2 Generated

For the router, it proposed 3 instruction candidates. The most detailed one:

“Analyze the user’s query and identify its core intent: Is it seeking factual knowledge or information (-> research), looking to purchase or find a specific product with potential constraints like price or features (-> product), or planning travel, a trip itinerary, or destination experiences (-> trip)? Pay close attention to keywords, numerical parameters (prices, durations, dates), and the underlying goal of the request.”

For the research agent, it bootstrapped 3 demos including detailed chain-of-thought reasoning:

{
  "query": "How does photosynthesis work?",
  "reasoning": "Photosynthesis is the process by which plants convert
    light energy into chemical energy stored as glucose. It occurs
    primarily in the chloroplasts... The process has two main stages:
    the light-dependent reactions and the Calvin cycle...",
  "answer": "Photosynthesis converts light energy into chemical energy
    (glucose) in two stages..."
}

These bootstrapped demos teach the production model how to reason, not just what to answer.

Cost Optimization

The most exciting finding: 3 out of 4 agents worked perfectly on Haiku 4.5, which is roughly 10x cheaper than Sonnet 4.6. The optimized prompts (with detailed instructions and few-shot demos) gave the cheaper model enough guidance to match Sonnet’s quality.

The product agent scored only 50% on Haiku — but that’s because we only had 2 training examples. More data would likely fix this.

Production Test

We ran 3 queries through the full Strands multi-agent system:

“What is quantum entanglement?” → Correctly routed to research_assistant → Detailed, accurate response with applications table
“Recommend a good laptop for video editing” → Correctly routed to product_assistant → 5 specific models with prices across budget tiers
“Plan a weekend trip to Napa Valley” → Correctly routed to trip_assistant → Full itinerary with times, restaurants, costs, and tips

All three queries were routed correctly and produced high-quality, detailed responses.

Scalability: An Honest Assessment

What Works

Dimension	Assessment
Runtime latency	2-5s per query (orchestrator + sub-agent = 2 LLM calls). Fine.
Runtime throughput	Bounded by Bedrock rate limits, not code. Scales horizontally.
Prompt quality	Excellent. DSPy consistently finds better prompts than hand-tuning.
5-20 agents	Sweet spot. Manageable training data, reasonable optimization time.

What Breaks

Optimization cost grows with agents:

Agents	Est. LLM Calls	Est. Time	Est. Cost
4 (our demo)	~3,000	13 min	~$5
10	~8,000	35 min	~$15
30	~25,000	2 hrs	~$50
100+	~100,000+	8+ hrs	~$200+

No incremental updates. Add one agent? Re-optimize the router (since it now has a new routing target) plus the new agent. Change training data? Re-optimize everything.

Training data curation is manual. For 100 agents you need routing examples covering all types, plus quality examples per agent. This is a human bottleneck, not a compute one.

No runtime adaptation. Optimized prompts are frozen. If production traffic reveals new patterns, there’s no feedback loop — you manually add examples and re-run optimization.

Sub-agent instantiation cost. Our current code creates Agent() per tool call. At high QPS, this means connection churn to Bedrock. Fixable with pooling, but not implemented.

The Scalable Version Would Need

┌─────────────────────────────────────────────────┐
│              SCALABLE ARCHITECTURE               │
│                                                  │
│  1. Agent Registry (config files, not code)      │
│  2. Prompt Cache in DB/S3 (versioned, A/B-able)  │
│  3. Incremental Optimization (only changed agents)│
│  4. Agent Pool (pre-warmed, reusable instances)  │
│  5. Feedback Loop (prod logs → training data)    │
│                                                  │
└─────────────────────────────────────────────────┘

When to Use This Approach

Good Fit

You have fewer than ~20 agents with clear domains
You can define a measurable quality metric per agent
You can curate 10-50 training examples per agent
You want to reduce costs by using optimized prompts on cheaper models
You’re running on AWS Bedrock and want the Strands ecosystem
Prompt quality matters more than iteration speed

Bad Fit

You need dynamic agent creation at runtime (agents determined by input)
You have 100+ agents — optimization cost and data curation won’t scale
You need real-time adaptation — no feedback loop in this architecture
Your agent topology is a deep chain (A→B→C→D) — DSPy optimizes modules independently, missing cross-module interactions
You’re iterating rapidly on agent definitions (weekly changes = weekly re-optimization)

Alternatives to Consider

LangGraph — better for dynamic agent graphs and complex orchestration
CrewAI — better for agent-of-agents with role-based collaboration
Anthropic Agent SDK — native support for handoffs and sub-agent spawning
Manual prompt engineering — sometimes good enough

Conclusion

DSPy and Strands Agents serve different purposes that combine well:

DSPy is an offline prompt R&D lab. It systematically searches through instruction × few-shot combinations using Bayesian optimization, finding prompts that humans wouldn’t think to write.
Strands is a production agent runtime. It handles tool routing, sub-agent orchestration, and Bedrock integration.

The workflow is simple: optimize offline, deploy at runtime. The optimized prompts live in a JSON file — no framework lock-in, no magic. You could take those prompts and use them in any framework.

The most practical win? Cost reduction. Optimized prompts on Haiku 4.5 matched naive prompts on Sonnet 4.6 for 3 out of 4 agents in our test. That’s a 10x cost savings for the same quality — and all it took was 13 minutes of automated optimization.

The approach has real scalability limits beyond ~20 agents, and there’s no runtime learning loop. But for teams building focused multi-agent systems who want better prompts without the manual grind, this combination delivers.

Code and full implementation: see the README and the project files in this repository.

Posted 6th March 2026 at 5:10 pm · Subscribe to my newsletter

Akshay Parkhi's Weblog