feat: add agent coordination hooks to prevent duplicate work#703
feat: add agent coordination hooks to prevent duplicate work#703geoffjay merged 2 commits intofeature/autonomous-pipelinefrom
Conversation
|
This change is part of the following stack: Change managed by git-spice. |
geoffjay
left a comment
There was a problem hiding this comment.
Review: feat(hooks): agent coordination checkpoint
Good concept and solid overall structure — this directly addresses the v0.9.0 duplication incident where 3 agents independently created 52 overlapping issues. The fail-open design, 8 s timeout, and lock/unlock protocol are all the right instincts. The agent YAML additions are clear and consistent.
Two blocking bugs need fixing before merge, plus a few non-blocking suggestions below.
🔴 Blocking: f-string missing on warn message in handle_issue_close
In handle_issue_close (check-coordination.py ~line 282), the first string literal in the warn() call has an f prefix but the continuation strings do not. Python does not automatically promote adjacent string literals to f-strings — only the first segment is interpolated. The second and third strings will render literally as #{kept_number} and #{issue_number} in the agent's context, making the warning completely useless.
Fix — add f to every continuation string:
warn(
f"⚠️ Closing issue #{issue_number} as duplicate of #{kept_number}.\n\n"
f"Before proceeding, verify that the issue being KEPT (#{kept_number}) "
f"is at least as detailed as the one being closed (#{issue_number}). "
"If the issue you are closing is more detailed, consider merging its "
"content into the kept issue first."
)🔴 Blocking: check_active_lock doesn't handle concurrent locks correctly
check_active_lock (check-coordination.py ~line 197) tracks a single active_lock variable. If Agent A posts LOCK, then Agent B posts LOCK, then Agent A posts UNLOCK — the function clears active_lock entirely (because any UNLOCK clears the single slot), even though Agent B's lock is still live. A third agent running gh issue create will see no active lock and proceed, violating Agent B's lock.
This is exactly the scenario that can arise during a busy pipeline run with planner and conductor both doing bulk ops simultaneously — the very situation this PR is meant to prevent.
Minimal fix — track per-agent locks by keying on the actor/sender field:
def check_active_lock(messages: list[dict]) -> str | None:
active_locks: dict[str, str] = {} # actor -> lock message
for msg in messages:
content = msg.get("content") or msg.get("body") or msg.get("text") or ""
actor = msg.get("sender") or msg.get("actor") or "unknown"
if LOCK_MARKER_RE.search(content):
active_locks[actor] = content.strip()
elif UNLOCK_MARKER_RE.search(content):
active_locks.pop(actor, None)
# Return the first still-active lock, if any
return next(iter(active_locks.values()), None)This assumes each actor uses consistent sender metadata, which the communicate service provides. If actor identity in messages is unreliable, a counter-based approach (increment on LOCK, decrement on UNLOCK, block when > 0) is a safe fallback.
🟡 Non-blocking suggestions
1. gh issue reopen routed to handle_issue_close (check-coordination.py, routing in main())
The duplicate-check logic in handle_issue_close (looking for closes #N, duplicate of, etc.) is only meaningful for close operations. Routing reopen through the same handler runs that check against a command that will never contain those patterns — dead code that could confuse future maintainers. Consider a separate handle_issue_reopen that only does the lock check and then calls allow().
2. No unit tests
392 lines of Python with non-trivial branching and no unit tests is a gap, especially for a safety mechanism. At minimum check_active_lock, _extract_title, and the main() routing logic are worth covering. A tests/test_check_coordination.py with a few pytest cases would give confidence that the lock semantics are correct across concurrent scenarios.
3. get_room_id fetches only 100 rooms
_http_get("/rooms?limit=100&offset=0") — if the communicate service has more than 100 rooms this silently fails to find engineering and the hook fails open without any indication. A name-based query endpoint (if available) or a larger limit would be more robust.
4. Lock window undocumented as design decision
The 30-minute LOCK_WINDOW_MINUTES means a lock placed before a long conductor sync (> 30 min) will silently expire. This is probably intentional (stale locks shouldn't block forever) but it is not called out in the module docstring. A one-liner explaining the trade-off would help the next person who hits a mysterious bypass.
5. Unnecessary f prefix (trivial)
_http_get(f"/rooms?limit=100&offset=0") — no interpolation, the f is a no-op.
Stack note: This PR stacks directly on feature/autonomous-pipeline — no parent open PRs to worry about.
Blocking fixes:
- Fix check_active_lock() to track locks per-actor using a dict keyed on
the message sender field. Previously a single active_lock variable meant
Agent A's [UNLOCK] would clear Agent B's still-active lock, allowing a
third agent to bypass B's hold. Now each actor's lock/unlock pair is
tracked independently.
- Fix f-string missing on continuation strings in handle_issue_close():
the kept_number and issue_number variables were rendered literally as
#{kept_number} / #{issue_number} in the warn() message. Added f prefix
to all strings that reference those variables.
Non-blocking fixes:
- Add handle_issue_reopen() — a dedicated handler that runs only the lock
check (no duplicate-detection logic). Update main() routing so
gh issue reopen dispatches there instead of handle_issue_close().
- Remove spurious f prefix from _http_get("/rooms?limit=100&offset=0")
(no interpolation was happening).
- Document the 30-minute LOCK_WINDOW_MINUTES trade-off in the module
docstring: stale locks expire intentionally, agents doing extended work
should post periodic [LOCK] refreshes.
- Add .claude/tests/test_check_coordination.py with 17 unittest cases
covering check_active_lock() (concurrent locks, case insensitivity,
ordering), _extract_title(), and main() dispatch routing. Runnable
with stdlib unittest (no pytest required).
feat: add agent coordination hooks to prevent duplicate work
feat(hooks): add agent coordination hook to prevent duplicate work
Add .claude/hooks/check-coordination.py, a PreToolUse hook that intercepts
gh issue create, gh issue close/reopen, and agent communicate announcements
commands to enforce coordination between concurrent agents.
The hook:
allowing issue creation or closure (blocks if a lock is active)
gh issue create and injects a warning if similar open issues exist
already posted on the same topic
Add the hook to .claude/settings.json as a PreToolUse/Bash hook with an 8 s
timeout.
Add lock/unlock protocol documentation to the system prompts of the four
agents most likely to perform bulk mutations: planner, worker, conductor, and
triage. Each prompt now explains when to acquire a [LOCK], how to release it
with [UNLOCK], and that the hook enforces these signals automatically.
Closes #657
Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com