dfinity · raymondk · Mar 5, 2026 · Mar 4, 2026 · Mar 4, 2026 · Mar 5, 2026
@@ -39,6 +39,17 @@ npm run validate     # Fix all errors before committing. Warnings are acceptable
 ```
 Validate runs in CI and blocks deployment on errors.
 
+## Evaluations
+
+Each skill should have an evaluation file at `evaluations/<skill-name>.json`. Run evaluations with:
+```bash
+node scripts/evaluate-skills.js <skill-name>              # All evals
+node scripts/evaluate-skills.js <skill-name> --eval "X"   # Single eval by name
+node scripts/evaluate-skills.js <skill-name> --no-baseline # Skip without-skill baseline
+node scripts/evaluate-skills.js <skill-name> --triggers-only # Trigger evals only
+```
+Results are saved to `evaluations/results/` (gitignored). See `evaluations/icp-cli.json` for the format.
+
 ## Writing Guidelines
 
 - **Write for agents, not humans.** Be explicit with canister IDs, function signatures, and error messages.

@@ -9,3 +9,5 @@ public/llms.txt
 public/llms-full.txt
 .astro
 lighthouse-*
+.eval-tmp
+evaluations/results/
@@ -123,13 +123,35 @@ npm run validate     # Check frontmatter and sections
 
 This runs automatically in CI and blocks deployment on errors.
 
-### 4. That's it — the website auto-discovers skills
+### 4. Add evaluation cases
+
+Create `evaluations/<skill-name>.json` with test cases that verify the skill works. The eval file has two sections:
+
+- **`output_evals`** — realistic prompts with expected behaviors a judge can check
+- **`trigger_evals`** — queries that should/shouldn't activate the skill
+
+See `evaluations/icp-cli.json` for a working example. Write prompts the way a developer would actually ask — vague and incomplete, not over-specified test questions.
+
+**Running evaluations** (optional, requires `claude` CLI):
+
+```bash
+node scripts/evaluate-skills.js <skill-name>                    # All evals, with + without skill
+node scripts/evaluate-skills.js <skill-name> --eval "name"      # Single eval
+node scripts/evaluate-skills.js <skill-name> --no-baseline       # Skip without-skill run
+node scripts/evaluate-skills.js <skill-name> --triggers-only     # Trigger evals only
+```
+
+This sends each prompt to Claude with and without the skill, then has a judge score the output. Results are saved to `evaluations/results/` (gitignored).
+
+Including a summary of eval results in your PR description is recommended but not required — running evals needs `claude` CLI access and costs API credits.
+
+### 5. That's it — the website auto-discovers skills
 
 The website is automatically generated from the SKILL.md frontmatter at build time. You do **not** need to edit any source file. Astro reads all `skills/*/SKILL.md` files, parses their frontmatter, and generates the site pages, `llms.txt`, discovery endpoints, and other files.
 
 Stats (skill count, categories) all update automatically.
 
-### 5. Submit a PR
+### 6. Submit a PR
 
 - One skill per PR
 - Include a brief description of what the skill covers and why it's needed

@@ -86,6 +86,7 @@ See [CONTRIBUTING.md](CONTRIBUTING.md) for how to add or update skills.
 - **Hosting**: GitHub Pages via Actions
 - **Skills**: Plain markdown files in `skills/*/SKILL.md`
 - **Validation**: Structural linter for frontmatter and code blocks (`npm run validate`)
+- **Evaluation**: Per-skill eval cases with LLM-as-judge scoring (`node scripts/evaluate-skills.js <skill>`)
 - **Schema**: JSON Schema for frontmatter at `skills/skill.schema.json`
 - **SEO**: Per-skill meta tags, JSON-LD (TechArticle), sitemap, canonical URLs
 - **Skills Discovery**: `llms.txt`, `llms-full.txt`, `.well-known/skills/` ([Skills Discovery RFC](https://github.com/cloudflare/agent-skills-discovery-rfc))

@@ -0,0 +1,70 @@
+{
+  "skill": "icp-cli",
+  "description": "Evaluation cases for the icp-cli skill. Tests whether agents produce correct icp-cli commands and configuration instead of legacy dfx equivalents.",
+
+  "output_evals": [
+    {
+      "name": "New project setup",
+      "prompt": "I want to build a dapp on ICP with a Rust backend and a React frontend. How do I set this up?",
+      "expected_behaviors": [
+        "Uses icp (not dfx) commands throughout",
+        "Configuration file is icp.yaml, NOT dfx.json",
+        "Canisters are a YAML array of objects (- name: ...), NOT a keyed map",
+        "Rust canister uses a recipe with a version pin (e.g., @dfinity/rust@v3.2.0)",
+        "Frontend/asset canister uses a recipe with a version pin",
+        "Asset canister recipe includes explicit build commands",
+        "Shows how to start the local network (icp network start -d)"
+      ]
+    },
+    {
+      "name": "Deploy to mainnet",
+      "prompt": "My canisters work locally, how do I get them on mainnet?",
+      "expected_behaviors": [
+        "Uses 'icp deploy -e ic', NOT 'dfx deploy --network ic' or '--network ic'",
+        "Mentions cycles are needed",
+        "Mentions canister IDs are stored in .icp/data/ and should be committed to git",
+        "Does NOT use --network ic flag for deployment"
+      ]
+    },
+    {
+      "name": "Migrate from dfx",
+      "prompt": "I have an older IC project that still uses dfx and dfx.json. It has a Motoko backend and a frontend. I want to switch to the new CLI. I also have canisters running on mainnet already.",
+      "expected_behaviors": [
+        "Creates icp.yaml with recipe-based canister configuration",
+        "Motoko canister uses @dfinity/motoko recipe with a version pin",
+        "Asset canister uses @dfinity/asset-canister recipe with a version pin",
+        "Explains identity migration (export from dfx, import into icp)",
+        "Explains canister ID migration via .icp/data/mappings/ic.ids.json",
+        "Uses correct icp identity commands ('icp identity default' not 'icp identity use')"
+      ]
+    }
+  ],
+
+  "trigger_evals": {
+    "description": "Queries to test whether the skill activates correctly. 'should_trigger' queries should cause the skill to load; 'should_not_trigger' queries should NOT activate this skill.",
+    "should_trigger": [
+      "Set up a new Internet Computer project with Rust",
+      "How do I deploy my canister to the local network?",
+      "What's the icp.yaml config for a Motoko canister?",
+      "I'm getting an error with dfx deploy, can you help?",
+      "How do I start the local replica?",
+      "Migrate my dfx.json project to the new CLI",
+      "How do I create a new identity for mainnet deployment?",
+      "What recipes are available for icp-cli?",
+      "My icp deploy is failing with a build error",
+      "How do I check my canister status on mainnet?"
+    ],
+    "should_not_trigger": [
+      "Add access control to my Motoko canister",
+      "How does stable memory work in Rust canisters?",
+      "Implement ICRC-1 token transfer in my canister",
+      "Write a unit test for my Motoko actor",
+      "Set up inter-canister calls between two canisters",
+      "How do I use certified variables?",
+      "Explain the IC consensus mechanism",
+      "Add Internet Identity login to my frontend",
+      "How do I handle canister upgrades safely?",
+      "What's the best way to store large data on-chain?"
+    ]
+  }
+}