Finding the Perfect Prompt: Combining DSPy’s Optimization with Strands Agents for Cost-Effective Multi-Agent Systems
6th March 2026
How we used Bayesian optimization to find better prompts automatically — and made cheap models perform like expensive ones.
Full source code: github.com/avparkhi/dspy-strands-optimizer
Introduction
Every multi-agent system has a dirty secret: the quality of its outputs depends almost entirely on the system prompts wired into each agent. A misworded routing instruction sends queries to the wrong sub-agent. A vague research prompt produces shallow answers. A product recommendation prompt that forgets to mention prices is useless.
Most teams iterate on these prompts by hand — write, test, tweak, repeat. It works, but it’s slow, subjective, and doesn’t scale past a handful of agents.
What if you could automate the prompt search? Define what “good” looks like as a metric, hand over your training data, and let an optimizer find the best prompt for each agent?
That’s exactly what we built. We combined:
- DSPy (Stanford NLP) — a framework that treats prompts as optimizable parameters and uses Bayesian search to find the best ones
- Strands Agents (AWS) — a runtime framework for building multi-agent systems using the agents-as-tools pattern
DSPy finds the prompts offline. Strands runs the agents in production. They’re complementary, not competing.
The Problem
Consider a typical customer service multi-agent setup:
User Query
│
▼
┌──────────────┐
│ Orchestrator │ ──→ Which agent should handle this?
└──────┬───────┘
│
┌────┼────────────┐
▼ ▼ ▼
Research Product Trip
Agent Agent Agent
Each box needs a system prompt. The orchestrator needs routing logic. Each sub-agent needs domain expertise. That’s 4 prompts to get right, and they interact — a great research prompt is useless if the orchestrator never routes research queries to it.
The traditional approach:
Human writes prompt → tests on 5 examples → tweaks wording → tests again → ships it → hopes for the best
The DSPy approach:
DSPy generates 100 prompt candidates → scores all of them → generates better ones → finds the winner
How DSPy Finds the Right Prompt
We cloned the DSPy repository and read the actual optimization code. Here’s what happens under the hood.
Strategy 1: BootstrapFewShot — Find the Right Examples
Source: dspy/teleprompt/bootstrap.py
The simplest optimizer. It doesn’t change the instruction text — it finds the best few-shot examples to stuff into the prompt.
class BootstrapFewShot(Teleprompter):
def compile(self, student, *, teacher=None, trainset):
self._prepare_student_and_teacher(student, teacher)
self._bootstrap() # Run teacher on examples, keep good traces
self.student = self._train() # Insert best demos into student
return self.student
How it works:
- Takes your program (the “student”) and a “teacher” model
- Runs the teacher on each training example, capturing the full execution trace — every input and output at every step
- Checks if the output passes your metric (accuracy, keyword match, etc.)
- If it passes, that trace becomes a bootstrapped demo — a worked example showing the model how to reason
- Stuffs the best demos into the student’s prompt
The key insight: it uses the LM itself to generate correct worked examples, then feeds those back as few-shot demonstrations.
Strategy 2: COPRO — Let the LM Rewrite Its Own Instructions
Source: dspy/teleprompt/copro_optimizer.py
This one actually rewrites the instruction text. It’s hill-climbing where the LM is the search operator.
class BasicGenerateInstruction(Signature):
"""You are an instruction optimizer for large language models.
I will give you a signature of fields (inputs and outputs) in English.
Your task is to propose an instruction that will lead a good language
model to perform the task well. Don't be afraid to be creative."""
basic_instruction = dspy.InputField()
proposed_instruction = dspy.OutputField()
The loop:
- Start with your initial instruction (e.g., “Answer the question”)
- Ask the LM to propose
breadth=10new instructions - Evaluate each by running the full program on training data and scoring with your metric
- Feed the top scorers + their scores back:
class GenerateInstructionGivenAttempts(dspy.Signature):
"""I will give some task instructions I've tried, along with their
validation scores. The instructions are arranged in increasing order
based on their scores. Propose a new instruction that will perform
even better."""
attempted_instructions = dspy.InputField()
proposed_instruction = dspy.OutputField()
- Repeat for
depth=3iterations - Return the best-scoring program
Strategy 3: MIPROv2 — Bayesian Optimization Over Everything
Source: dspy/teleprompt/mipro_optimizer_v2.py
This is what we used. It combines instruction optimization + few-shot selection using Optuna’s Bayesian optimization.
Step 1: Bootstrap Few-Shot Demo Candidates
Same as BootstrapFewShot, but generates N independent sets of demos (we used N=6). Each set is created with a different random seed, giving diverse examples.
Step 2: Propose Instruction Candidates via GroundedProposer
This is the most interesting part. The GroundedProposer (dspy/propose/grounded_proposer.py) doesn’t just ask “write a better instruction.” It gives the LM rich context about the task:
class GroundedProposer(Proposer):
def __init__(self, prompt_model, program, trainset, ...):
# 1. Summarize the dataset
self.data_summary = create_dataset_summary(trainset)
# 2. Read the program's actual Python source code
self.program_code_string = get_dspy_source_code(program)
When generating an instruction, it:
- Summarizes your dataset — “This dataset contains user queries about travel, products, and research topics”
- Reads your program’s source code — literally inspects the Python class and describes what it does
- Describes each module’s role — “This predictor classifies queries into categories”
- Picks a random tip from a curated set:
TIPS = {
"none": "",
"creative": "Don't be afraid to be creative when creating the new instruction!",
"simple": "Keep the instruction clear and concise.",
"description": "Make sure your instruction is very informative and descriptive.",
"high_stakes": "The instruction should include a high stakes scenario!",
"persona": 'Include a persona that is relevant to the task (ie. "You are a ...")',
}
- Feeds everything to the LM with a unique
rollout_idand temperature to bypass cache and get diverse candidates
Here’s what a generated instruction looked like for our router:
“Analyze the user’s query and classify it into exactly one of three specialist agent categories: research (factual, informational, or knowledge-based questions), product (shopping, purchasing, or product recommendation requests — often containing price ranges, brand names, or specifications), or trip (travel planning, itinerary creation, destination guidance). Carefully examine the query’s intent, key signals (such as budget constraints, travel dates, or factual subject matter), and domain to determine the most appropriate routing.”
That’s way more detailed than our original “Classify a user query and route it to the correct specialist agent” — and DSPy generated it automatically.
Step 3: Bayesian Search with Optuna
Now we have N instruction candidates and N demo sets per predictor. The search space is:
For each predictor:
instruction ∈ {candidate_0, candidate_1, ..., candidate_N}
demo_set ∈ {demo_set_0, demo_set_1, ..., demo_set_N}
MIPROv2 uses Optuna’s Tree-structured Parzen Estimator (TPE) to search this space efficiently:
sampler = optuna.samplers.TPESampler(seed=seed, multivariate=True)
study = optuna.create_study(direction="maximize", sampler=sampler)
def objective(trial):
# Pick instruction and demo set for each predictor
for i, predictor in enumerate(program.predictors()):
instruction_idx = trial.suggest_categorical(
f"{i}_predictor_instruction", range(len(instruction_candidates[i]))
)
demos_idx = trial.suggest_categorical(
f"{i}_predictor_demos", range(len(demo_candidates[i]))
)
# Evaluate on validation set
score = evaluate(candidate_program)
return score
study.optimize(objective, n_trials=num_trials)
Unlike random search, TPE learns which combinations score well and focuses future trials on promising regions. It also uses minibatching — evaluating on small subsets first, then doing full evaluation only on the top-averaging candidates.
The Complete Flow
Training Data
│
▼
┌─────────────┐ ┌──────────────────┐
│ Bootstrap │────▶│ N demo candidate │
│ Few-Shot │ │ sets │
└─────────────┘ └────────┬─────────┘
│
┌─────────────┐ ┌────────▼─────────┐
│ Grounded │────▶│ N instruction │
│ Proposer │ │ candidates │
└─────────────┘ └────────┬─────────┘
│
┌────────▼─────────┐
Optuna TPE │ Bayesian │──▶ Best (instruction, demos)
- random tips │ Search │ combo per predictor
- past attempts └──────────────────┘
Building the Multi-Agent System
The DSPy Side: Defining What to Optimize
We defined 4 DSPy modules — one router and three specialists:
class RouteQuery(dspy.Signature):
"""Classify a user query and route it to the correct specialist agent."""
query = dspy.InputField(desc="The user's question or request")
agent = dspy.OutputField(desc="One of: research, product, trip")
class ResearchAnswer(dspy.Signature):
"""Answer a factual research question accurately and concisely."""
query = dspy.InputField()
answer = dspy.OutputField()
And metrics that define “good”:
def routing_metric(example, pred, trace=None):
"""Did it pick the right agent?"""
return pred.agent.strip().lower() == example.expected_agent.strip().lower()
def answer_quality_metric(example, pred, trace=None):
"""Does the answer contain expected keywords?"""
expected_words = set(example.answer.lower().split())
matched = sum(1 for w in expected_words if w in pred.answer.lower())
return matched / max(len(expected_words), 1) >= 0.4
Then we let MIPROv2 loose:
optimizer = dspy.MIPROv2(
metric=routing_metric,
auto="light", # 6 candidates, 10 trials
max_bootstrapped_demos=3,
max_labeled_demos=3,
)
optimized = optimizer.compile(RouterModule(), trainset=ROUTING_EXAMPLES)
The Strands Side: Deploying Optimized Prompts
After optimization, we extract the winning prompts and build Strands agents:
from strands import Agent, tool
from strands.models import BedrockModel
# Load DSPy-optimized prompts
prompts = json.load(open("optimized_prompts.json"))
@tool
def research_assistant(query: str) -> str:
"""Handle factual research questions about science, history, how things work."""
agent = Agent(
system_prompt=build_system_prompt(prompts["research"]),
model=BedrockModel(model_id="us.anthropic.claude-sonnet-4-6"),
callback_handler=None,
)
return str(agent(query))
# Orchestrator uses all sub-agents as tools
orchestrator = Agent(
system_prompt=build_system_prompt(prompts["router"]),
model=BedrockModel(model_id="us.anthropic.claude-sonnet-4-6"),
tools=[research_assistant, product_assistant, trip_assistant],
)
The build_system_prompt function converts DSPy’s structured output (instruction + demos) into a Strands-compatible system prompt string, including the few-shot examples that DSPy found.
Results
Optimization Performance
All models used: Claude Sonnet 4.6 on AWS Bedrock (us.anthropic.claude-sonnet-4-6).
| Agent | Optimization Method | Sonnet 4.6 Score | Haiku 4.5 Score | Notes |
|---|---|---|---|---|
| Router | MIPROv2 (10 trials) | 100% | 100% | All 10 trials scored 100% |
| Research | BootstrapFewShot | 100% | 100% | 3 bootstrapped demos |
| Product | BootstrapFewShot | 100% | 50% | Only 2 training examples |
| Trip | BootstrapFewShot | 100% | 100% | 2 bootstrapped demos |
- Optimization time: ~13 minutes for all 4 agents
- LLM calls: ~3,000 for the router alone (10 trials × 12 eval examples × multiple candidates)
- Cost: Approximately $5-10 in Bedrock API costs
What MIPROv2 Generated
For the router, it proposed 3 instruction candidates. The most detailed one:
“Analyze the user’s query and identify its core intent: Is it seeking factual knowledge or information (-> research), looking to purchase or find a specific product with potential constraints like price or features (-> product), or planning travel, a trip itinerary, or destination experiences (-> trip)? Pay close attention to keywords, numerical parameters (prices, durations, dates), and the underlying goal of the request.”
For the research agent, it bootstrapped 3 demos including detailed chain-of-thought reasoning:
{
"query": "How does photosynthesis work?",
"reasoning": "Photosynthesis is the process by which plants convert
light energy into chemical energy stored as glucose. It occurs
primarily in the chloroplasts... The process has two main stages:
the light-dependent reactions and the Calvin cycle...",
"answer": "Photosynthesis converts light energy into chemical energy
(glucose) in two stages..."
}
These bootstrapped demos teach the production model how to reason, not just what to answer.
Cost Optimization
The most exciting finding: 3 out of 4 agents worked perfectly on Haiku 4.5, which is roughly 10x cheaper than Sonnet 4.6. The optimized prompts (with detailed instructions and few-shot demos) gave the cheaper model enough guidance to match Sonnet’s quality.
The product agent scored only 50% on Haiku — but that’s because we only had 2 training examples. More data would likely fix this.
Production Test
We ran 3 queries through the full Strands multi-agent system:
- “What is quantum entanglement?” → Correctly routed to research_assistant → Detailed, accurate response with applications table
- “Recommend a good laptop for video editing” → Correctly routed to product_assistant → 5 specific models with prices across budget tiers
- “Plan a weekend trip to Napa Valley” → Correctly routed to trip_assistant → Full itinerary with times, restaurants, costs, and tips
All three queries were routed correctly and produced high-quality, detailed responses.
Scalability: An Honest Assessment
What Works
| Dimension | Assessment |
|---|---|
| Runtime latency | 2-5s per query (orchestrator + sub-agent = 2 LLM calls). Fine. |
| Runtime throughput | Bounded by Bedrock rate limits, not code. Scales horizontally. |
| Prompt quality | Excellent. DSPy consistently finds better prompts than hand-tuning. |
| 5-20 agents | Sweet spot. Manageable training data, reasonable optimization time. |
What Breaks
Optimization cost grows with agents:
| Agents | Est. LLM Calls | Est. Time | Est. Cost |
|---|---|---|---|
| 4 (our demo) | ~3,000 | 13 min | ~$5 |
| 10 | ~8,000 | 35 min | ~$15 |
| 30 | ~25,000 | 2 hrs | ~$50 |
| 100+ | ~100,000+ | 8+ hrs | ~$200+ |
No incremental updates. Add one agent? Re-optimize the router (since it now has a new routing target) plus the new agent. Change training data? Re-optimize everything.
Training data curation is manual. For 100 agents you need routing examples covering all types, plus quality examples per agent. This is a human bottleneck, not a compute one.
No runtime adaptation. Optimized prompts are frozen. If production traffic reveals new patterns, there’s no feedback loop — you manually add examples and re-run optimization.
Sub-agent instantiation cost. Our current code creates Agent() per tool call. At high QPS, this means connection churn to Bedrock. Fixable with pooling, but not implemented.
The Scalable Version Would Need
┌─────────────────────────────────────────────────┐
│ SCALABLE ARCHITECTURE │
│ │
│ 1. Agent Registry (config files, not code) │
│ 2. Prompt Cache in DB/S3 (versioned, A/B-able) │
│ 3. Incremental Optimization (only changed agents)│
│ 4. Agent Pool (pre-warmed, reusable instances) │
│ 5. Feedback Loop (prod logs → training data) │
│ │
└─────────────────────────────────────────────────┘
When to Use This Approach
Good Fit
- You have fewer than ~20 agents with clear domains
- You can define a measurable quality metric per agent
- You can curate 10-50 training examples per agent
- You want to reduce costs by using optimized prompts on cheaper models
- You’re running on AWS Bedrock and want the Strands ecosystem
- Prompt quality matters more than iteration speed
Bad Fit
- You need dynamic agent creation at runtime (agents determined by input)
- You have 100+ agents — optimization cost and data curation won’t scale
- You need real-time adaptation — no feedback loop in this architecture
- Your agent topology is a deep chain (A→B→C→D) — DSPy optimizes modules independently, missing cross-module interactions
- You’re iterating rapidly on agent definitions (weekly changes = weekly re-optimization)
Alternatives to Consider
- LangGraph — better for dynamic agent graphs and complex orchestration
- CrewAI — better for agent-of-agents with role-based collaboration
- Anthropic Agent SDK — native support for handoffs and sub-agent spawning
- Manual prompt engineering — sometimes good enough
Conclusion
DSPy and Strands Agents serve different purposes that combine well:
- DSPy is an offline prompt R&D lab. It systematically searches through instruction × few-shot combinations using Bayesian optimization, finding prompts that humans wouldn’t think to write.
- Strands is a production agent runtime. It handles tool routing, sub-agent orchestration, and Bedrock integration.
The workflow is simple: optimize offline, deploy at runtime. The optimized prompts live in a JSON file — no framework lock-in, no magic. You could take those prompts and use them in any framework.
The most practical win? Cost reduction. Optimized prompts on Haiku 4.5 matched naive prompts on Sonnet 4.6 for 3 out of 4 agents in our test. That’s a 10x cost savings for the same quality — and all it took was 13 minutes of automated optimization.
The approach has real scalability limits beyond ~20 agents, and there’s no runtime learning loop. But for teams building focused multi-agent systems who want better prompts without the manual grind, this combination delivers.
Code and full implementation: see the README and the project files in this repository.
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026