Archive for Friday, 20th February 2026

Prompt vs Skill vs Tool

1) Prompt (Runtime System Prompt)

What it is: Instructions passed in the API call for a single request.

One-time instruction (per request)
Not enforced; the LLM may skip or reorder steps
Good for quick control (tone, format, role)

Use when: prototyping, low-risk tasks, or temporary behavior changes.

Avoid when: you need guaranteed step execution or strict sequencing at scale.

2) Skill (Reusable Structured Prompt Module)

What it is: A reusable, structured reasoning template/module that improves consistency across repeated tasks.

Reusable and standardized
More consistent than ad-hoc prompts
Still LLM-driven (probabilistic), not a hard execution engine

Use when: the task repeats often and you want consistent analysis structure, formatting, or output schema.

Avoid when: the workflow must never skip steps or must follow an exact sequence every time.

3) Tool (Deterministic Capability)

What it is: An executable function that performs a real action (API call, database query, file write, etc.).

Deterministic execution (given correct code and inputs)
Interacts with real systems or data
Auditable and testable

Use when: you need real data, guaranteed operations, and repeatable correctness.

Important: Orchestrator (Code) for Strict Multi-Step Workflows

If your process requires a fixed sequence of steps that must always execute in order, the most reliable design is:

Use code (an orchestrator or state machine) to enforce the required steps deterministically.
Then pass the final combined results to the LLM for reasoning, optionally using a Skill for consistent formatting.

Quick Decision Rule

Need guaranteed execution? Use Tools + Orchestrator.
Need consistent repeated reasoning/output? Use a Skill.
Need a one-off behavior tweak? Use a Prompt.

[... 273 words]

4:41 am / ai-agents

Eval methods for Tools, Skills, and Prompts, and how to ensure correctness

1. Evaluating Tools (MCP / Agentic Tools)

Tools have the most structured evaluation surface because they expose defined inputs and outputs.

Metrics to Measure

Metric	What It Checks
Tool Correctness	Did the agent pick the right tool?
Argument Correctness	Were the arguments passed correctly?
Ordering	Were tools called in the right sequence?
Task Completion	Did the end-to-end trajectory achieve the goal?

Methods

a) Deterministic comparison (fastest, most reliable)

Compare tools_called vs expected_tools — name match, argument match, and order match.

from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import ToolCorrectnessMetric

test_case = LLMTestCase(
    input="What's the return policy?",
    actual_output="We offer a 30-day refund.",
    tools_called=[ToolCall(name="WebSearch"), ToolCall(name="ToolQuery")],
    expected_tools=[ToolCall(name="WebSearch")],
)

metric = ToolCorrectnessMetric(threshold=0.7)

# Score = Correctly Used Tools / Total Tools Called

b) Trajectory-level evaluation

Do not just check the final output. Evaluate the full sequence of tool calls to detect missing tools, extra tools, and parameter mismatches.

c) LLM-as-judge fallback

When tool usage is correct but non-obvious, use a judge model to assess whether the chosen tools were optimal given the available tools context.

2. Evaluating Skills

Skills require evaluation of both activation (did the correct skill load?) and output quality (did it improve results?).

a) Skill Activation / Routing Evals

Prompt	Expected Skill	should_trigger
Review this PR for security issues	security-review	true
Fix the typo on line 3	security-review	false
Check this code for vulnerabilities	security-review	true

Grading is deterministic pass or fail — did the expected skill activate?

b) Skill Output Quality Evals

LLM-as-judge with rubric scoring (1 to 5 scale)
Exact or string match for structured sections
A/B comparison (with skill vs without skill)

c) Progressive Disclosure Check

Measure token usage when multiple skills are available to ensure context does not grow unnecessarily.

3. Evaluating Prompts

a) Code-based grading (preferred)

# Exact match
def eval_exact(output, expected):
    return output.strip().lower() == expected.lower()

# String containment
def eval_contains(output, key_phrase):
    return key_phrase in output

b) LLM-as-judge (nuanced assessment)

def evaluate_likert(model_output, rubric):
    prompt = f"""Rate this response on a scale of 1-5:
    <rubric>{rubric}</rubric>
    <response>{model_output}</response>
    Think step-by-step, then output only the number."""
    return call_judge_model(prompt)

c) Embedding similarity

Use cosine similarity to ensure paraphrased inputs produce semantically consistent outputs.

d) ROUGE-L for summarization

Measures overlap between generated and reference summaries.

Universal Best Practices

Volume over perfection — automate at scale.
Include edge cases such as typos, ambiguity, long inputs, and topic shifts.
Use a different model as judge than the one being evaluated.
Ask the judge to reason before scoring.
Automate and version evaluations like tests.
Combine deterministic checks with LLM-based scoring.

Quick Reference

TOOLS — deterministic tool and argument match plus trajectory validation
SKILLS — activation tests plus rubric-based output quality
PROMPTS — exact match where possible plus LLM-judge for qualitative tasks

[... 530 words]

4:48 am / ai-agents

Understanding LLM-Driven Python Execution: Architecture, Terminology, and Use Cases

This pattern is not just “tool use.” It is a Reasoning → Execution → Observation loop where the LLM can generate Python during runtime and run it inside a sandbox, producing deterministic outputs.

[... 646 words]

1:31 pm / ai-agents

NVIDIA’s GR00T Whole-Body Control stack in MuJoCo

I’ve been running NVIDIA’s GR00T Whole-Body Control stack in MuJoCo — the sim-to-real bridge for humanoid robot locomotion. A MuJoCo viewer showing a simulated robot walking might look like a toy, but the neural network policy inside it is the same binary that runs on a real Unitree G1. Here’s what’s actually going on.

[... 759 words]

6:42 pm / physical-ai

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Akshay Parkhi's Weblog

Friday, 20th February 2026

Prompt vs Skill vs Tool

1) Prompt (Runtime System Prompt)

2) Skill (Reusable Structured Prompt Module)

3) Tool (Deterministic Capability)

Important: Orchestrator (Code) for Strict Multi-Step Workflows

Quick Decision Rule

Eval methods for Tools, Skills, and Prompts, and how to ensure correctness

Eval methods for Tools, Skills, and Prompts, and how to ensure correctness

1. Evaluating Tools (MCP / Agentic Tools)

2. Evaluating Skills

3. Evaluating Prompts

Universal Best Practices

Quick Reference

Understanding LLM-Driven Python Execution: Architecture, Terminology, and Use Cases

NVIDIA’s GR00T Whole-Body Control stack in MuJoCo