Open
Conversation
Add a lightweight eval framework that tests skill effectiveness by comparing agent output with and without the skill loaded. Uses the `claude` CLI for both agent runs and judging — no external API keys or infrastructure needed. - scripts/run-evals.js: eval runner (with/without skill + judge) - skills/icp-cli/evals.json: 3 output evals + 20 trigger evals - Validator now warns if a skill is missing evals.json - Updated CONTRIBUTING.md, CLAUDE.md, README.md with eval guidance
Member
Author
Eval results:
|
| Eval | WITH skill | WITHOUT skill |
|---|---|---|
| New project setup | 7/7 ✅ | 0/7 ❌ |
| Deploy to mainnet | 4/4 ✅ | 1/4 ❌ |
| Migrate from dfx | 6/6 ✅ | 0/6 ❌ |
Key findings
Without the skill, Claude consistently:
- Uses
dfxcommands anddfx.jsonconfig (doesn't know abouticpCLI) - Uses
--network icfor mainnet deployment instead of-e ic - Generates JSON keyed-map canister config instead of YAML array syntax
- Doesn't mention recipes, version pinning, or
.icp/data/commit guidance
With the skill, all outputs correctly use icp commands, icp.yaml, recipes with version pins, -e ic for mainnet, and proper identity migration paths.
Prompts used (realistic developer phrasing)
- "I want to build a dapp on ICP with a Rust backend and a React frontend. How do I set this up?"
- "My canisters work locally, how do I get them on mainnet?"
- "I have an older IC project that still uses dfx and dfx.json. It has a Motoko backend and a frontend. I want to switch to the new CLI. I also have canisters running on mainnet already."
Presents all skill descriptions as a catalog to a judge, then checks whether each query correctly selects (or avoids) the target skill. Batches all queries into a single judge call for efficiency. Usage: node scripts/run-evals.js icp-cli --triggers-only
Member
Author
Trigger eval results:
|
| Query | Selected |
|---|---|
| Set up a new Internet Computer project with Rust | icp-cli ✅ |
| How do I deploy my canister to the local network? | icp-cli ✅ |
| What's the icp.yaml config for a Motoko canister? | icp-cli ✅ |
| I'm getting an error with dfx deploy, can you help? | icp-cli ✅ |
| How do I start the local replica? | icp-cli ✅ |
| Migrate my dfx.json project to the new CLI | icp-cli ✅ |
| How do I create a new identity for mainnet deployment? | icp-cli ✅ |
| What recipes are available for icp-cli? | icp-cli ✅ |
| My icp deploy is failing with a build error | icp-cli ✅ |
| How do I check my canister status on mainnet? | icp-cli ✅ |
Should NOT trigger — all correctly routed elsewhere:
| Query | Selected |
|---|---|
| Add access control to my Motoko canister | none ✅ |
| How does stable memory work in Rust canisters? | stable-memory ✅ |
| Implement ICRC-1 token transfer in my canister | icrc-ledger ✅ |
| Write a unit test for my Motoko actor | none ✅ |
| Set up inter-canister calls between two canisters | multi-canister ✅ |
| How do I use certified variables? | certified-variables ✅ |
| Explain the IC consensus mechanism | none ✅ |
| Add Internet Identity login to my frontend | internet-identity ✅ |
| How do I handle canister upgrades safely? | stable-memory ✅ |
| What's the best way to store large data on-chain? | stable-memory ✅ |
The description's "Do NOT use for..." clause is working — no overtriggering on adjacent topics.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/run-evals.js— a lightweight eval runner that usesclaudeCLI to test skills with LLM-as-judge scoringskills/icp-cli/evals.jsonas the first skill eval definition (3 output evals + 20 trigger evals)evals.jsonHow it works
For each eval case, the runner:
claude -pwith the skill as system promptskills/<name>/eval-results/(gitignored)Initial icp-cli results
Without the skill, Claude defaults to
dfxcommands,dfx.jsonconfig, and--network icflags. With the skill, all outputs use correcticpCLI syntax.Context
Follows from the evaluation discussion in #52. Approach aligns with Anthropic's best practices: define eval cases per skill, run them during authoring, don't gate CI on LLM-based checks. No external API keys needed — uses the local
claudeCLI.