Skip to content

feat: add agent coordination hooks to prevent duplicate work#703

Merged
geoffjay merged 2 commits intofeature/autonomous-pipelinefrom
issue-657
Mar 23, 2026
Merged

feat: add agent coordination hooks to prevent duplicate work#703
geoffjay merged 2 commits intofeature/autonomous-pipelinefrom
issue-657

Conversation

@geoffjay
Copy link
Owner

feat: add agent coordination hooks to prevent duplicate work

feat(hooks): add agent coordination hook to prevent duplicate work

Add .claude/hooks/check-coordination.py, a PreToolUse hook that intercepts
gh issue create, gh issue close/reopen, and agent communicate announcements
commands to enforce coordination between concurrent agents.

The hook:

  • Checks the engineering communicate room for active [LOCK] messages before
    allowing issue creation or closure (blocks if a lock is active)
  • Performs a fuzzy duplicate search (via gh issue list --search) before any
    gh issue create and injects a warning if similar open issues exist
  • Warns before posting to the announcements room if a recent message was
    already posted on the same topic
  • Fails open in < 2 s if the communicate service is unreachable

Add the hook to .claude/settings.json as a PreToolUse/Bash hook with an 8 s
timeout.

Add lock/unlock protocol documentation to the system prompts of the four
agents most likely to perform bulk mutations: planner, worker, conductor, and
triage. Each prompt now explains when to acquire a [LOCK], how to release it
with [UNLOCK], and that the hook enforces these signals automatically.

Closes #657

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com

@geoffjay geoffjay added the review-agent Used to invoke a review by an agent tracking this label label Mar 22, 2026
@geoffjay
Copy link
Owner Author

geoffjay commented Mar 22, 2026

This change is part of the following stack:

Change managed by git-spice.

Copy link
Owner Author

@geoffjay geoffjay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat(hooks): agent coordination checkpoint

Good concept and solid overall structure — this directly addresses the v0.9.0 duplication incident where 3 agents independently created 52 overlapping issues. The fail-open design, 8 s timeout, and lock/unlock protocol are all the right instincts. The agent YAML additions are clear and consistent.

Two blocking bugs need fixing before merge, plus a few non-blocking suggestions below.


🔴 Blocking: f-string missing on warn message in handle_issue_close

In handle_issue_close (check-coordination.py ~line 282), the first string literal in the warn() call has an f prefix but the continuation strings do not. Python does not automatically promote adjacent string literals to f-strings — only the first segment is interpolated. The second and third strings will render literally as #{kept_number} and #{issue_number} in the agent's context, making the warning completely useless.

Fix — add f to every continuation string:

warn(
    f"⚠️  Closing issue #{issue_number} as duplicate of #{kept_number}.\n\n"
    f"Before proceeding, verify that the issue being KEPT (#{kept_number}) "
    f"is at least as detailed as the one being closed (#{issue_number}). "
    "If the issue you are closing is more detailed, consider merging its "
    "content into the kept issue first."
)

🔴 Blocking: check_active_lock doesn't handle concurrent locks correctly

check_active_lock (check-coordination.py ~line 197) tracks a single active_lock variable. If Agent A posts LOCK, then Agent B posts LOCK, then Agent A posts UNLOCK — the function clears active_lock entirely (because any UNLOCK clears the single slot), even though Agent B's lock is still live. A third agent running gh issue create will see no active lock and proceed, violating Agent B's lock.

This is exactly the scenario that can arise during a busy pipeline run with planner and conductor both doing bulk ops simultaneously — the very situation this PR is meant to prevent.

Minimal fix — track per-agent locks by keying on the actor/sender field:

def check_active_lock(messages: list[dict]) -> str | None:
    active_locks: dict[str, str] = {}  # actor -> lock message
    for msg in messages:
        content = msg.get("content") or msg.get("body") or msg.get("text") or ""
        actor = msg.get("sender") or msg.get("actor") or "unknown"
        if LOCK_MARKER_RE.search(content):
            active_locks[actor] = content.strip()
        elif UNLOCK_MARKER_RE.search(content):
            active_locks.pop(actor, None)
    # Return the first still-active lock, if any
    return next(iter(active_locks.values()), None)

This assumes each actor uses consistent sender metadata, which the communicate service provides. If actor identity in messages is unreliable, a counter-based approach (increment on LOCK, decrement on UNLOCK, block when > 0) is a safe fallback.


🟡 Non-blocking suggestions

1. gh issue reopen routed to handle_issue_close (check-coordination.py, routing in main())
The duplicate-check logic in handle_issue_close (looking for closes #N, duplicate of, etc.) is only meaningful for close operations. Routing reopen through the same handler runs that check against a command that will never contain those patterns — dead code that could confuse future maintainers. Consider a separate handle_issue_reopen that only does the lock check and then calls allow().

2. No unit tests
392 lines of Python with non-trivial branching and no unit tests is a gap, especially for a safety mechanism. At minimum check_active_lock, _extract_title, and the main() routing logic are worth covering. A tests/test_check_coordination.py with a few pytest cases would give confidence that the lock semantics are correct across concurrent scenarios.

3. get_room_id fetches only 100 rooms
_http_get("/rooms?limit=100&offset=0") — if the communicate service has more than 100 rooms this silently fails to find engineering and the hook fails open without any indication. A name-based query endpoint (if available) or a larger limit would be more robust.

4. Lock window undocumented as design decision
The 30-minute LOCK_WINDOW_MINUTES means a lock placed before a long conductor sync (> 30 min) will silently expire. This is probably intentional (stale locks shouldn't block forever) but it is not called out in the module docstring. A one-liner explaining the trade-off would help the next person who hits a mysterious bypass.

5. Unnecessary f prefix (trivial)
_http_get(f"/rooms?limit=100&offset=0") — no interpolation, the f is a no-op.


Stack note: This PR stacks directly on feature/autonomous-pipeline — no parent open PRs to worry about.

@geoffjay geoffjay added the needs-rework PR has review feedback that must be addressed before merging label Mar 23, 2026
Blocking fixes:
- Fix check_active_lock() to track locks per-actor using a dict keyed on
  the message sender field. Previously a single active_lock variable meant
  Agent A's [UNLOCK] would clear Agent B's still-active lock, allowing a
  third agent to bypass B's hold. Now each actor's lock/unlock pair is
  tracked independently.
- Fix f-string missing on continuation strings in handle_issue_close():
  the kept_number and issue_number variables were rendered literally as
  #{kept_number} / #{issue_number} in the warn() message. Added f prefix
  to all strings that reference those variables.

Non-blocking fixes:
- Add handle_issue_reopen() — a dedicated handler that runs only the lock
  check (no duplicate-detection logic). Update main() routing so
  gh issue reopen dispatches there instead of handle_issue_close().
- Remove spurious f prefix from _http_get("/rooms?limit=100&offset=0")
  (no interpolation was happening).
- Document the 30-minute LOCK_WINDOW_MINUTES trade-off in the module
  docstring: stale locks expire intentionally, agents doing extended work
  should post periodic [LOCK] refreshes.
- Add .claude/tests/test_check_coordination.py with 17 unittest cases
  covering check_active_lock() (concurrent locks, case insensitivity,
  ordering), _extract_title(), and main() dispatch routing. Runnable
  with stdlib unittest (no pytest required).
@geoffjay geoffjay merged commit 6eaa65b into feature/autonomous-pipeline Mar 23, 2026
@geoffjay geoffjay deleted the issue-657 branch March 23, 2026 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rework PR has review feedback that must be addressed before merging review-agent Used to invoke a review by an agent tracking this label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant