feat: add skill evaluation harness by marc0olo · Pull Request #69 · dfinity/icskills

marc0olo · 2026-03-04T17:41:37Z

Summary

Adds scripts/run-evals.js — a lightweight eval runner that uses claude CLI to test skills with LLM-as-judge scoring
Adds skills/icp-cli/evals.json as the first skill eval definition (3 output evals + 20 trigger evals)
Validator now warns when a skill is missing evals.json
Updated CONTRIBUTING.md, CLAUDE.md, README.md with eval documentation

How it works

For each eval case, the runner:

Sends a realistic developer prompt to claude -p with the skill as system prompt
Sends the same prompt without the skill (baseline)
Asks Claude (as judge) to score each expected behavior as pass/fail
Prints a summary and saves full results to skills/<name>/eval-results/ (gitignored)

node scripts/run-evals.js icp-cli                    # All evals
node scripts/run-evals.js icp-cli --eval "Deploy"    # Single eval
node scripts/run-evals.js icp-cli --no-baseline       # Skip baseline

Initial icp-cli results

Eval	WITH skill	WITHOUT skill
New project setup	7/7	0/7
Deploy to mainnet	4/4	1/4
Migrate from dfx	6/6	0/6

Without the skill, Claude defaults to dfx commands, dfx.json config, and --network ic flags. With the skill, all outputs use correct icp CLI syntax.

Context

Follows from the evaluation discussion in #52. Approach aligns with Anthropic's best practices: define eval cases per skill, run them during authoring, don't gate CI on LLM-based checks. No external API keys needed — uses the local claude CLI.

Add a lightweight eval framework that tests skill effectiveness by comparing agent output with and without the skill loaded. Uses the `claude` CLI for both agent runs and judging — no external API keys or infrastructure needed. - scripts/run-evals.js: eval runner (with/without skill + judge) - skills/icp-cli/evals.json: 3 output evals + 20 trigger evals - Validator now warns if a skill is missing evals.json - Updated CONTRIBUTING.md, CLAUDE.md, README.md with eval guidance

marc0olo · 2026-03-04T17:44:08Z

Eval results: `icp-cli` skill

Run: node scripts/run-evals.js icp-cli (Sonnet for agent, default model for judge)

Scores

Eval	WITH skill	WITHOUT skill
New project setup	7/7 ✅	0/7 ❌
Deploy to mainnet	4/4 ✅	1/4 ❌
Migrate from dfx	6/6 ✅	0/6 ❌

Key findings

Without the skill, Claude consistently:

Uses dfx commands and dfx.json config (doesn't know about icp CLI)
Uses --network ic for mainnet deployment instead of -e ic
Generates JSON keyed-map canister config instead of YAML array syntax
Doesn't mention recipes, version pinning, or .icp/data/ commit guidance

With the skill, all outputs correctly use icp commands, icp.yaml, recipes with version pins, -e ic for mainnet, and proper identity migration paths.

Prompts used (realistic developer phrasing)

"I want to build a dapp on ICP with a Rust backend and a React frontend. How do I set this up?"
"My canisters work locally, how do I get them on mainnet?"
"I have an older IC project that still uses dfx and dfx.json. It has a Motoko backend and a frontend. I want to switch to the new CLI. I also have canisters running on mainnet already."

Presents all skill descriptions as a catalog to a judge, then checks whether each query correctly selects (or avoids) the target skill. Batches all queries into a single judge call for efficiency. Usage: node scripts/run-evals.js icp-cli --triggers-only

marc0olo · 2026-03-04T17:54:22Z

Trigger eval results: `icp-cli` skill

Run: node scripts/run-evals.js icp-cli --triggers-only

Scores

Should trigger: 10/10 ✅
Should NOT trigger: 10/10 ✅

Details

Should trigger — all correctly routed to icp-cli:

Query	Selected
Set up a new Internet Computer project with Rust	icp-cli ✅
How do I deploy my canister to the local network?	icp-cli ✅
What's the icp.yaml config for a Motoko canister?	icp-cli ✅
I'm getting an error with dfx deploy, can you help?	icp-cli ✅
How do I start the local replica?	icp-cli ✅
Migrate my dfx.json project to the new CLI	icp-cli ✅
How do I create a new identity for mainnet deployment?	icp-cli ✅
What recipes are available for icp-cli?	icp-cli ✅
My icp deploy is failing with a build error	icp-cli ✅
How do I check my canister status on mainnet?	icp-cli ✅

Should NOT trigger — all correctly routed elsewhere:

Query	Selected
Add access control to my Motoko canister	none ✅
How does stable memory work in Rust canisters?	stable-memory ✅
Implement ICRC-1 token transfer in my canister	icrc-ledger ✅
Write a unit test for my Motoko actor	none ✅
Set up inter-canister calls between two canisters	multi-canister ✅
How do I use certified variables?	certified-variables ✅
Explain the IC consensus mechanism	none ✅
Add Internet Identity login to my frontend	internet-identity ✅
How do I handle canister upgrades safely?	stable-memory ✅
What's the best way to store large data on-chain?	stable-memory ✅

The description's "Do NOT use for..." clause is working — no overtriggering on adjacent topics.

marc0olo requested review from JoshDFN and raymondk as code owners March 4, 2026 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add skill evaluation harness#69

feat: add skill evaluation harness#69
marc0olo wants to merge 2 commits intomainfrom
marc0olo/eval-harness

marc0olo commented Mar 4, 2026

Uh oh!

marc0olo commented Mar 4, 2026

Uh oh!

marc0olo commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marc0olo commented Mar 4, 2026

Summary

How it works

Initial icp-cli results

Context

Uh oh!

marc0olo commented Mar 4, 2026

Eval results: icp-cli skill

Scores

Key findings

Prompts used (realistic developer phrasing)

Uh oh!

marc0olo commented Mar 4, 2026

Trigger eval results: icp-cli skill

Scores

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Eval results: `icp-cli` skill

Trigger eval results: `icp-cli` skill