Skip to content

feat: add skill evaluation harness#69

Open
marc0olo wants to merge 2 commits intomainfrom
marc0olo/eval-harness
Open

feat: add skill evaluation harness#69
marc0olo wants to merge 2 commits intomainfrom
marc0olo/eval-harness

Conversation

@marc0olo
Copy link
Member

@marc0olo marc0olo commented Mar 4, 2026

Summary

  • Adds scripts/run-evals.js — a lightweight eval runner that uses claude CLI to test skills with LLM-as-judge scoring
  • Adds skills/icp-cli/evals.json as the first skill eval definition (3 output evals + 20 trigger evals)
  • Validator now warns when a skill is missing evals.json
  • Updated CONTRIBUTING.md, CLAUDE.md, README.md with eval documentation

How it works

For each eval case, the runner:

  1. Sends a realistic developer prompt to claude -p with the skill as system prompt
  2. Sends the same prompt without the skill (baseline)
  3. Asks Claude (as judge) to score each expected behavior as pass/fail
  4. Prints a summary and saves full results to skills/<name>/eval-results/ (gitignored)
node scripts/run-evals.js icp-cli                    # All evals
node scripts/run-evals.js icp-cli --eval "Deploy"    # Single eval
node scripts/run-evals.js icp-cli --no-baseline       # Skip baseline

Initial icp-cli results

Eval WITH skill WITHOUT skill
New project setup 7/7 0/7
Deploy to mainnet 4/4 1/4
Migrate from dfx 6/6 0/6

Without the skill, Claude defaults to dfx commands, dfx.json config, and --network ic flags. With the skill, all outputs use correct icp CLI syntax.

Context

Follows from the evaluation discussion in #52. Approach aligns with Anthropic's best practices: define eval cases per skill, run them during authoring, don't gate CI on LLM-based checks. No external API keys needed — uses the local claude CLI.

Add a lightweight eval framework that tests skill effectiveness by
comparing agent output with and without the skill loaded. Uses the
`claude` CLI for both agent runs and judging — no external API keys
or infrastructure needed.

- scripts/run-evals.js: eval runner (with/without skill + judge)
- skills/icp-cli/evals.json: 3 output evals + 20 trigger evals
- Validator now warns if a skill is missing evals.json
- Updated CONTRIBUTING.md, CLAUDE.md, README.md with eval guidance
@marc0olo
Copy link
Member Author

marc0olo commented Mar 4, 2026

Eval results: icp-cli skill

Run: node scripts/run-evals.js icp-cli (Sonnet for agent, default model for judge)

Scores

Eval WITH skill WITHOUT skill
New project setup 7/7 0/7 ❌
Deploy to mainnet 4/4 1/4 ❌
Migrate from dfx 6/6 0/6 ❌

Key findings

Without the skill, Claude consistently:

  • Uses dfx commands and dfx.json config (doesn't know about icp CLI)
  • Uses --network ic for mainnet deployment instead of -e ic
  • Generates JSON keyed-map canister config instead of YAML array syntax
  • Doesn't mention recipes, version pinning, or .icp/data/ commit guidance

With the skill, all outputs correctly use icp commands, icp.yaml, recipes with version pins, -e ic for mainnet, and proper identity migration paths.

Prompts used (realistic developer phrasing)

  1. "I want to build a dapp on ICP with a Rust backend and a React frontend. How do I set this up?"
  2. "My canisters work locally, how do I get them on mainnet?"
  3. "I have an older IC project that still uses dfx and dfx.json. It has a Motoko backend and a frontend. I want to switch to the new CLI. I also have canisters running on mainnet already."

Presents all skill descriptions as a catalog to a judge, then checks
whether each query correctly selects (or avoids) the target skill.
Batches all queries into a single judge call for efficiency.

Usage: node scripts/run-evals.js icp-cli --triggers-only
@marc0olo
Copy link
Member Author

marc0olo commented Mar 4, 2026

Trigger eval results: icp-cli skill

Run: node scripts/run-evals.js icp-cli --triggers-only

Scores

  • Should trigger: 10/10 ✅
  • Should NOT trigger: 10/10 ✅

Details

Should trigger — all correctly routed to icp-cli:

Query Selected
Set up a new Internet Computer project with Rust icp-cli ✅
How do I deploy my canister to the local network? icp-cli ✅
What's the icp.yaml config for a Motoko canister? icp-cli ✅
I'm getting an error with dfx deploy, can you help? icp-cli ✅
How do I start the local replica? icp-cli ✅
Migrate my dfx.json project to the new CLI icp-cli ✅
How do I create a new identity for mainnet deployment? icp-cli ✅
What recipes are available for icp-cli? icp-cli ✅
My icp deploy is failing with a build error icp-cli ✅
How do I check my canister status on mainnet? icp-cli ✅

Should NOT trigger — all correctly routed elsewhere:

Query Selected
Add access control to my Motoko canister none ✅
How does stable memory work in Rust canisters? stable-memory ✅
Implement ICRC-1 token transfer in my canister icrc-ledger ✅
Write a unit test for my Motoko actor none ✅
Set up inter-canister calls between two canisters multi-canister ✅
How do I use certified variables? certified-variables ✅
Explain the IC consensus mechanism none ✅
Add Internet Identity login to my frontend internet-identity ✅
How do I handle canister upgrades safely? stable-memory ✅
What's the best way to store large data on-chain? stable-memory ✅

The description's "Do NOT use for..." clause is working — no overtriggering on adjacent topics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant