Skip to content

perf: parallelize skill trigger evals across all skills#807

Merged
stack72 merged 1 commit intomainfrom
speed-up-claude-skill-evals
Mar 21, 2026
Merged

perf: parallelize skill trigger evals across all skills#807
stack72 merged 1 commit intomainfrom
speed-up-claude-skill-evals

Conversation

@stack72
Copy link
Contributor

@stack72 stack72 commented Mar 21, 2026

Summary

  • Refactor eval_skill_triggers.ts to dispatch all queries across all skills
    into a single global worker pool instead of running skills sequentially.
    With 25 concurrent workers, all ~555 claude -p calls (185 queries × 3 runs)
    execute in parallel rather than 5-at-a-time per skill serially.
  • Remove the now-unused runSkillEval and EvalOutput abstractions — the
    three-phase approach (load → dispatch → aggregate) is simpler and faster.
  • Bump default EVAL_WORKERS from 5 to 25 to match the global pool model.

Test plan

  • deno run eval-skill-triggers — all skills pass, significantly faster
  • --skill <name> single-skill filter still works
  • Results table output unchanged (same format, same GITHUB_STEP_SUMMARY)
  • --debug flag still works (applies to first query only)

🤖 Generated with Claude Code

github-actions[bot]

This comment was marked as outdated.

github-actions[bot]

This comment was marked as outdated.

## Summary

- Refactor `eval_skill_triggers.ts` to dispatch all queries across all skills
  into a single global worker pool instead of running skills sequentially.
  With 25 concurrent workers, all ~555 `claude -p` calls (185 queries × 3 runs)
  execute in parallel rather than 5-at-a-time per skill serially.
- Remove the now-unused `runSkillEval` and `EvalOutput` abstractions — the
  three-phase approach (load → dispatch → aggregate) is simpler and faster.
- Bump default `EVAL_WORKERS` from 5 to 25 to match the global pool model.

## Test plan

- [x] `deno run eval-skill-triggers` — all skills pass, significantly faster
- [x] `--skill <name>` single-skill filter still works
- [x] Results table output unchanged (same format, same GITHUB_STEP_SUMMARY)
- [x] `--debug` flag still works (applies to first query only)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@stack72 stack72 force-pushed the speed-up-claude-skill-evals branch from 612614d to 0f3aa46 Compare March 21, 2026 00:23
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Approve

Clean performance refactoring that replaces per-skill sequential execution with a global worker pool. The three-phase approach (load → dispatch → aggregate) is well-structured and correctly preserves the original aggregation semantics.

What I checked:

  • CLAUDE.md compliance (license header, no any types, named exports) ✓
  • No libswamp import boundary violations (standalone script) ✓
  • Concurrency correctness — pooled() index management is safe under JS single-threaded model ✓
  • Removed code (EvalOutput, runSkillEval) is genuinely unused ✓
  • Error handling paths preserved (validation errors, SKILL.md parse failures) ✓
  • DDD: N/A — eval script, not domain code

Suggestions (non-blocking):

  • CI now uses EVAL_RUNS=1 (was 3), eliminating statistical smoothing. A flaky query now deterministically passes or fails. This is a reasonable speed tradeoff but changes the sensitivity profile — worth revisiting if eval flakiness increases.
  • EVAL_MODEL: "claude-haiku-4-5-20251001" is a good cost optimization for CI. If Haiku's skill-routing behavior diverges from the default model, eval results may not fully reflect production accuracy.

🤖 Generated with Claude Code

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None found.

Medium

  1. Duplicate query collision in aggregationscripts/eval_skill_triggers.ts:713-718

    When aggregating results in Phase 3, queries are keyed by (skillName, query_string). If a skill's trigger_evals.json contains the same query string twice with different should_trigger values:

    [
      {"query": "check the vault", "should_trigger": true},
      {"query": "check the vault", "should_trigger": false}
    ]

    …the second entry's should_trigger is silently ignored — all trigger results are merged under the first entry's should_trigger value, producing incorrect pass/fail evaluation. validateEvalSet does not check for duplicate queries.

    This was also a latent bug in the old runSkillEval (same Map-by-query pattern), so it's not a regression, but the refactor was an opportunity to fix it.

    Suggested fix: Add a duplicate-query check in validateEvalSet, or key the inner map by index/(query, should_trigger) tuple.

  2. pooled worker race on nextIndexscripts/eval_skill_triggers.ts:489-493

    The pooled function uses a shared nextIndex variable incremented by multiple async workers. This is safe today because JS is single-threaded and the read-then-increment of nextIndex happens synchronously between await points. However, it relies on an implicit invariant that no microtask or runtime behavior inserts between the while check and nextIndex++. This is correct in Deno's current model but is fragile — a future refactor that adds an await between the check and the increment would silently break. Not a bug, just a note.

Low

  1. CI runs reduced from 3 to 1.github/workflows/ci.yml:137

    EVAL_RUNS is dropped from 3 to 1. With a single run and a triggerThreshold of 0.5, each query result is purely binary — a single LLM response determines pass/fail with zero statistical smoothing. This is a deliberate tradeoff for speed, but worth noting that flaky evals that previously passed 2/3 will now randomly fail 33% of the time.

  2. plans empty → silent successscripts/eval_skill_triggers.ts:692-697

    If all skills fail validation or SKILL.md parsing (all go to errorScores, plans is empty), tasks will be empty, pooled returns [], and the loop over plans is a no-op. The errorScores correctly set allPassed = false, so this doesn't produce a wrong exit code. But the "Dispatching 0 queries" log line could be confusing.

Verdict

PASS — The refactor is a straightforward flattening of per-skill sequential execution into a single global pool. The logic for building tasks, dispatching, and re-aggregating results is correct. The pooled concurrency model is safe under JS single-threading. No critical or high issues found.

@stack72 stack72 merged commit b6c5f88 into main Mar 21, 2026
8 of 9 checks passed
@stack72 stack72 deleted the speed-up-claude-skill-evals branch March 21, 2026 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant