perf: parallelize skill trigger evals across all skills by stack72 · Pull Request #807 · systeminit/swamp

stack72 · 2026-03-21T00:15:05Z

Summary

Refactor eval_skill_triggers.ts to dispatch all queries across all skills
into a single global worker pool instead of running skills sequentially.
With 25 concurrent workers, all ~555 claude -p calls (185 queries × 3 runs)
execute in parallel rather than 5-at-a-time per skill serially.
Remove the now-unused runSkillEval and EvalOutput abstractions — the
three-phase approach (load → dispatch → aggregate) is simpler and faster.
Bump default EVAL_WORKERS from 5 to 25 to match the global pool model.

Test plan

deno run eval-skill-triggers — all skills pass, significantly faster
--skill <name> single-skill filter still works
Results table output unchanged (same format, same GITHUB_STEP_SUMMARY)
--debug flag still works (applies to first query only)

🤖 Generated with Claude Code

## Summary - Refactor `eval_skill_triggers.ts` to dispatch all queries across all skills into a single global worker pool instead of running skills sequentially. With 25 concurrent workers, all ~555 `claude -p` calls (185 queries × 3 runs) execute in parallel rather than 5-at-a-time per skill serially. - Remove the now-unused `runSkillEval` and `EvalOutput` abstractions — the three-phase approach (load → dispatch → aggregate) is simpler and faster. - Bump default `EVAL_WORKERS` from 5 to 25 to match the global pool model. ## Test plan - [x] `deno run eval-skill-triggers` — all skills pass, significantly faster - [x] `--skill <name>` single-skill filter still works - [x] Results table output unchanged (same format, same GITHUB_STEP_SUMMARY) - [x] `--debug` flag still works (applies to first query only) 🤖 Generated with [Claude Code](https://claude.com/claude-code)

github-actions

Review: Approve

Clean performance refactoring that replaces per-skill sequential execution with a global worker pool. The three-phase approach (load → dispatch → aggregate) is well-structured and correctly preserves the original aggregation semantics.

What I checked:

CLAUDE.md compliance (license header, no any types, named exports) ✓
No libswamp import boundary violations (standalone script) ✓
Concurrency correctness — pooled() index management is safe under JS single-threaded model ✓
Removed code (EvalOutput, runSkillEval) is genuinely unused ✓
Error handling paths preserved (validation errors, SKILL.md parse failures) ✓
DDD: N/A — eval script, not domain code

Suggestions (non-blocking):

CI now uses EVAL_RUNS=1 (was 3), eliminating statistical smoothing. A flaky query now deterministically passes or fails. This is a reasonable speed tradeoff but changes the sensitivity profile — worth revisiting if eval flakiness increases.
EVAL_MODEL: "claude-haiku-4-5-20251001" is a good cost optimization for CI. If Haiku's skill-routing behavior diverges from the default model, eval results may not fully reflect production accuracy.

🤖 Generated with Claude Code

github-actions

Adversarial Review

Critical / High

None found.

Medium

Duplicate query collision in aggregation — scripts/eval_skill_triggers.ts:713-718

When aggregating results in Phase 3, queries are keyed by (skillName, query_string). If a skill's trigger_evals.json contains the same query string twice with different should_trigger values:
```
[
  {"query": "check the vault", "should_trigger": true},
  {"query": "check the vault", "should_trigger": false}
]
```
…the second entry's should_trigger is silently ignored — all trigger results are merged under the first entry's should_trigger value, producing incorrect pass/fail evaluation. validateEvalSet does not check for duplicate queries.

This was also a latent bug in the old runSkillEval (same Map-by-query pattern), so it's not a regression, but the refactor was an opportunity to fix it.

Suggested fix: Add a duplicate-query check in validateEvalSet, or key the inner map by index/(query, should_trigger) tuple.
pooled worker race on nextIndex — scripts/eval_skill_triggers.ts:489-493

The pooled function uses a shared nextIndex variable incremented by multiple async workers. This is safe today because JS is single-threaded and the read-then-increment of nextIndex happens synchronously between await points. However, it relies on an implicit invariant that no microtask or runtime behavior inserts between the while check and nextIndex++. This is correct in Deno's current model but is fragile — a future refactor that adds an await between the check and the increment would silently break. Not a bug, just a note.

Low

CI runs reduced from 3 to 1 — .github/workflows/ci.yml:137

EVAL_RUNS is dropped from 3 to 1. With a single run and a triggerThreshold of 0.5, each query result is purely binary — a single LLM response determines pass/fail with zero statistical smoothing. This is a deliberate tradeoff for speed, but worth noting that flaky evals that previously passed 2/3 will now randomly fail 33% of the time.
plans empty → silent success — scripts/eval_skill_triggers.ts:692-697

If all skills fail validation or SKILL.md parsing (all go to errorScores, plans is empty), tasks will be empty, pooled returns [], and the loop over plans is a no-op. The errorScores correctly set allPassed = false, so this doesn't produce a wrong exit code. But the "Dispatching 0 queries" log line could be confusing.

Verdict

PASS — The refactor is a straightforward flattening of per-skill sequential execution into a single global pool. The logic for building tasks, dispatching, and re-aggregating results is correct. The pooled concurrency model is safe under JS single-threading. No critical or high issues found.

This comment was marked as outdated.

Sign in to view

stack72 force-pushed the speed-up-claude-skill-evals branch from 612614d to 0f3aa46 Compare March 21, 2026 00:23

github-actions bot approved these changes Mar 21, 2026

View reviewed changes

github-actions bot reviewed Mar 21, 2026

View reviewed changes

stack72 merged commit b6c5f88 into main Mar 21, 2026
8 of 9 checks passed

stack72 deleted the speed-up-claude-skill-evals branch March 21, 2026 00:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallelize skill trigger evals across all skills#807

perf: parallelize skill trigger evals across all skills#807
stack72 merged 1 commit intomainfrom
speed-up-claude-skill-evals

stack72 commented Mar 21, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stack72 commented Mar 21, 2026

Summary

Test plan

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Review: Approve

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Adversarial Review

Critical / High

Medium

Low

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant