perf: parallelize skill trigger evals across all skills#807
Conversation
## Summary - Refactor `eval_skill_triggers.ts` to dispatch all queries across all skills into a single global worker pool instead of running skills sequentially. With 25 concurrent workers, all ~555 `claude -p` calls (185 queries × 3 runs) execute in parallel rather than 5-at-a-time per skill serially. - Remove the now-unused `runSkillEval` and `EvalOutput` abstractions — the three-phase approach (load → dispatch → aggregate) is simpler and faster. - Bump default `EVAL_WORKERS` from 5 to 25 to match the global pool model. ## Test plan - [x] `deno run eval-skill-triggers` — all skills pass, significantly faster - [x] `--skill <name>` single-skill filter still works - [x] Results table output unchanged (same format, same GITHUB_STEP_SUMMARY) - [x] `--debug` flag still works (applies to first query only) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
612614d to
0f3aa46
Compare
There was a problem hiding this comment.
Review: Approve
Clean performance refactoring that replaces per-skill sequential execution with a global worker pool. The three-phase approach (load → dispatch → aggregate) is well-structured and correctly preserves the original aggregation semantics.
What I checked:
- CLAUDE.md compliance (license header, no
anytypes, named exports) ✓ - No libswamp import boundary violations (standalone script) ✓
- Concurrency correctness —
pooled()index management is safe under JS single-threaded model ✓ - Removed code (
EvalOutput,runSkillEval) is genuinely unused ✓ - Error handling paths preserved (validation errors, SKILL.md parse failures) ✓
- DDD: N/A — eval script, not domain code
Suggestions (non-blocking):
- CI now uses
EVAL_RUNS=1(was 3), eliminating statistical smoothing. A flaky query now deterministically passes or fails. This is a reasonable speed tradeoff but changes the sensitivity profile — worth revisiting if eval flakiness increases. EVAL_MODEL: "claude-haiku-4-5-20251001"is a good cost optimization for CI. If Haiku's skill-routing behavior diverges from the default model, eval results may not fully reflect production accuracy.
🤖 Generated with Claude Code
There was a problem hiding this comment.
Adversarial Review
Critical / High
None found.
Medium
-
Duplicate query collision in aggregation —
scripts/eval_skill_triggers.ts:713-718When aggregating results in Phase 3, queries are keyed by
(skillName, query_string). If a skill'strigger_evals.jsoncontains the same query string twice with differentshould_triggervalues:[ {"query": "check the vault", "should_trigger": true}, {"query": "check the vault", "should_trigger": false} ]…the second entry's
should_triggeris silently ignored — all trigger results are merged under the first entry'sshould_triggervalue, producing incorrect pass/fail evaluation.validateEvalSetdoes not check for duplicate queries.This was also a latent bug in the old
runSkillEval(same Map-by-query pattern), so it's not a regression, but the refactor was an opportunity to fix it.Suggested fix: Add a duplicate-query check in
validateEvalSet, or key the inner map by index/(query, should_trigger)tuple. -
pooledworker race onnextIndex—scripts/eval_skill_triggers.ts:489-493The
pooledfunction uses a sharednextIndexvariable incremented by multiple async workers. This is safe today because JS is single-threaded and the read-then-increment ofnextIndexhappens synchronously betweenawaitpoints. However, it relies on an implicit invariant that no microtask or runtime behavior inserts between thewhilecheck andnextIndex++. This is correct in Deno's current model but is fragile — a future refactor that adds anawaitbetween the check and the increment would silently break. Not a bug, just a note.
Low
-
CI runs reduced from 3 to 1 —
.github/workflows/ci.yml:137EVAL_RUNSis dropped from 3 to 1. With a single run and atriggerThresholdof 0.5, each query result is purely binary — a single LLM response determines pass/fail with zero statistical smoothing. This is a deliberate tradeoff for speed, but worth noting that flaky evals that previously passed 2/3 will now randomly fail 33% of the time. -
plansempty → silent success —scripts/eval_skill_triggers.ts:692-697If all skills fail validation or SKILL.md parsing (all go to
errorScores,plansis empty),taskswill be empty,pooledreturns[], and the loop overplansis a no-op. TheerrorScorescorrectly setallPassed = false, so this doesn't produce a wrong exit code. But the "Dispatching 0 queries" log line could be confusing.
Verdict
PASS — The refactor is a straightforward flattening of per-skill sequential execution into a single global pool. The logic for building tasks, dispatching, and re-aggregating results is correct. The pooled concurrency model is safe under JS single-threading. No critical or high issues found.
Summary
eval_skill_triggers.tsto dispatch all queries across all skillsinto a single global worker pool instead of running skills sequentially.
With 25 concurrent workers, all ~555
claude -pcalls (185 queries × 3 runs)execute in parallel rather than 5-at-a-time per skill serially.
runSkillEvalandEvalOutputabstractions — thethree-phase approach (load → dispatch → aggregate) is simpler and faster.
EVAL_WORKERSfrom 5 to 25 to match the global pool model.Test plan
deno run eval-skill-triggers— all skills pass, significantly faster--skill <name>single-skill filter still works--debugflag still works (applies to first query only)🤖 Generated with Claude Code