fix: retry MCP callTool on any connection error, not just ErrSessionMissing#2215
fix: retry MCP callTool on any connection error, not just ErrSessionMissing#2215dgageot merged 1 commit intodocker:mainfrom
Conversation
…sionMissing When an MCP server restarts, callTool may receive transport errors (EOF, connection reset, broken pipe) instead of ErrSessionMissing depending on timing. Add an isConnectionError helper that selectively retries on known connection/session error types rather than retrying on all errors indiscriminately. The helper checks for: - mcp.ErrSessionMissing (server lost our session) - io.EOF (connection dropped) - net.Error (timeout, connection reset, etc.) - String-based fallback for transport errors the MCP SDK wraps with %v (losing the original error chain) This fixes flaky TestRemoteReconnectAfterServerRestart without broadening retry to application-level errors. Assisted-By: docker-agent
There was a problem hiding this comment.
Assessment: 🟡 NEEDS ATTENTION
Found 3 medium-severity issues in the new isConnectionError function. These are logic concerns that could cause false-positive reconnection attempts or miss actual connection errors in edge cases. The core fix for the flaky test is sound, but the error detection logic could be more robust.
| if msg := err.Error(); strings.Contains(msg, "connection reset") || | ||
| strings.Contains(msg, "connection refused") || | ||
| strings.Contains(msg, "broken pipe") || | ||
| strings.Contains(msg, "EOF") { |
There was a problem hiding this comment.
🟡 MEDIUM: Redundant EOF string matching could cause false positives
The isConnectionError function checks for EOF twice:
- Line 709:
errors.Is(err, io.EOF)(proper error chain check) - Line 714:
strings.Contains(msg, "EOF")(string matching)
If an error is io.EOF, it's already caught by line 709, making the string check redundant. More critically, the substring check could match unrelated errors that mention 'EOF' in their message (e.g., 'invalid EOF character in JSON', 'unexpected EOF in response body'), causing false-positive reconnection attempts.
Recommendation: Remove "EOF" from the string matching on line 714, since actual io.EOF errors are already handled by line 709.
| return true | ||
| } | ||
| var netErr net.Error | ||
| if errors.As(err, &netErr) { |
There was a problem hiding this comment.
🟡 MEDIUM: All net.Error types treated as retriable connection errors
The code treats any net.Error as a retriable connection error (lines 704-706):
if errors.As(err, &netErr) {
return true
}However, some net.Error implementations represent permanent failures that won't be fixed by retrying:
- DNS lookup failures for non-existent hosts (
no such host) - Invalid address formats
- Network unreachable errors (may indicate configuration issues)
Recommendation: Consider checking netErr.Temporary() or filtering for specific retriable error types to avoid unnecessary retry attempts on permanent errors.
| // client.Do) with its internal ErrRejected sentinel using %v, which | ||
| // drops the original error from the chain. Detect these by checking | ||
| // the error message for common transport-failure substrings. | ||
| if msg := err.Error(); strings.Contains(msg, "connection reset") || |
There was a problem hiding this comment.
🟡 MEDIUM: String-based error detection is brittle and locale-dependent
The string-matching logic (lines 711-714) checks for 'connection reset', 'connection refused', 'broken pipe', and 'EOF'. While the comment explains this is needed because the MCP SDK wraps errors with %v, this approach has several risks:
- Not stable: Error messages are not part of Go's compatibility guarantee and can change across versions or operating systems
- Locale-dependent: Non-English systems may have different error messages
- Phrasing variations: Different libraries might phrase errors differently (e.g., 'reset by peer' vs 'connection reset')
Recommendation: While this is a documented workaround for upstream SDK limitations, consider:
- Filing an issue with the MCP SDK to preserve error chains
- Adding test coverage for the specific error messages you expect
- Documenting the known limitation in code comments (already partially done)
When an MCP server restarts, callTool could receive transport errors (EOF, connection reset) instead of ErrSessionMissing depending on timing. Broaden the retry condition to trigger forceReconnectAndWait on any non-context-cancellation error, fixing a flaky TestRemoteReconnectAfterServerRestart.
Assisted-By: docker-agent