Eval methods for Tools, Skills, and Prompts, and how to ensure correctness
20th February 2026
Eval methods for Tools, Skills, and Prompts, and how to ensure correctness
1. Evaluating Tools (MCP / Agentic Tools)
Tools have the most structured evaluation surface because they expose defined inputs and outputs.
Metrics to Measure
| Metric | What It Checks |
|---|---|
| Tool Correctness | Did the agent pick the right tool? |
| Argument Correctness | Were the arguments passed correctly? |
| Ordering | Were tools called in the right sequence? |
| Task Completion | Did the end-to-end trajectory achieve the goal? |
Methods
a) Deterministic comparison (fastest, most reliable)
Compare tools_called vs expected_tools — name match, argument match, and order match.
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric
test_case = LLMTestCase(
input="What's the return policy?",
actual_output="We offer a 30-day refund.",
tools_called=[ToolCall(name="WebSearch"), ToolCall(name="ToolQuery")],
expected_tools=[ToolCall(name="WebSearch")],
)
metric = ToolCorrectnessMetric(threshold=0.7)
# Score = Correctly Used Tools / Total Tools Called
b) Trajectory-level evaluation
Do not just check the final output. Evaluate the full sequence of tool calls to detect missing tools, extra tools, and parameter mismatches.
c) LLM-as-judge fallback
When tool usage is correct but non-obvious, use a judge model to assess whether the chosen tools were optimal given the available tools context.
2. Evaluating Skills
Skills require evaluation of both activation (did the correct skill load?) and output quality (did it improve results?).
a) Skill Activation / Routing Evals
| Prompt | Expected Skill | should_trigger |
|---|---|---|
| Review this PR for security issues | security-review | true |
| Fix the typo on line 3 | security-review | false |
| Check this code for vulnerabilities | security-review | true |
Grading is deterministic pass or fail — did the expected skill activate?
b) Skill Output Quality Evals
- LLM-as-judge with rubric scoring (1 to 5 scale)
- Exact or string match for structured sections
- A/B comparison (with skill vs without skill)
c) Progressive Disclosure Check
Measure token usage when multiple skills are available to ensure context does not grow unnecessarily.
3. Evaluating Prompts
a) Code-based grading (preferred)
# Exact match
def eval_exact(output, expected):
return output.strip().lower() == expected.lower()
# String containment
def eval_contains(output, key_phrase):
return key_phrase in output
b) LLM-as-judge (nuanced assessment)
def evaluate_likert(model_output, rubric):
prompt = f"""Rate this response on a scale of 1-5:
<rubric>{rubric}</rubric>
<response>{model_output}</response>
Think step-by-step, then output only the number."""
return call_judge_model(prompt)
c) Embedding similarity
Use cosine similarity to ensure paraphrased inputs produce semantically consistent outputs.
d) ROUGE-L for summarization
Measures overlap between generated and reference summaries.
Universal Best Practices
- Volume over perfection — automate at scale.
- Include edge cases such as typos, ambiguity, long inputs, and topic shifts.
- Use a different model as judge than the one being evaluated.
- Ask the judge to reason before scoring.
- Automate and version evaluations like tests.
- Combine deterministic checks with LLM-based scoring.
Quick Reference
- TOOLS — deterministic tool and argument match plus trajectory validation
- SKILLS — activation tests plus rubric-based output quality
- PROMPTS — exact match where possible plus LLM-judge for qualitative tasks
More recent articles
- OpenUSD: Advanced Patterns and Common Gotchas. - 28th March 2026
- OpenUSD Mastery: From Composition to Pipeline — A SO-101 Arm Journey - 25th March 2026
- Learning OpenUSD — From Curious Questions to Real Understanding - 19th March 2026