Skip to content

fix(#1743): retry primary model on transient errors even without fallback models#1744

Draft
aheritier wants to merge 1 commit intodocker:mainfrom
aheritier:fix/retry-without-fallback-models
Draft

fix(#1743): retry primary model on transient errors even without fallback models#1744
aheritier wants to merge 1 commit intodocker:mainfrom
aheritier:fix/retry-without-fallback-models

Conversation

@aheritier
Copy link
Contributor

Summary

Fixes #1743

getEffectiveRetries returned 0 when no fallback models were configured, meaning retryable errors (5xx, timeouts) got zero retries. This caused Anthropic streaming Internal server error to immediately surface as "all models failed" instead of being retried with backoff.

Root Cause

The getEffectiveRetries function conflated "no fallback models to fall back to" with "no need to retry the same model":

// Before: returns 0 when no fallback models configured
if retries == 0 && len(a.FallbackModels()) > 0 {
    return DefaultFallbackRetries
}
return retries // returns 0

With maxAttempts = 1 + 0 = 1, even correctly-classified retryable errors got exactly one shot.

Fix

Changed getEffectiveRetries to always return DefaultFallbackRetries (2 retries = 3 total attempts) when no explicit retry count is configured, regardless of whether fallback models exist:

// After: always returns DefaultFallbackRetries when not explicitly configured
if retries == 0 {
    return DefaultFallbackRetries
}
return retries

Tests

  • Updated TestGetEffectiveRetries to expect DefaultFallbackRetries for agents without fallback models
  • Added TestPrimaryRetriesWithoutFallbackModels — regression test reproducing the exact scenario (Anthropic streaming Internal server error, no fallback models, verifies retry and recovery)
  • Added isRetryableModelError test case for the Anthropic streaming error format

…t fallback models

getEffectiveRetries returned 0 when no fallback models were configured,
meaning retryable errors (5xx, timeouts) got zero retries. This caused
Anthropic streaming 'Internal server error' to immediately surface as
'all models failed' instead of being retried with backoff.

Changed getEffectiveRetries to always return DefaultFallbackRetries (2)
when no explicit retry count is configured, regardless of whether
fallback models exist. Retrying the same model on transient errors is
always valuable.

Fixes docker#1743

Assisted-By: cagent
@aheritier aheritier force-pushed the fix/retry-without-fallback-models branch from 881b4c7 to b951002 Compare February 16, 2026 08:57
@aheritier aheritier changed the title fix: retry primary model on transient errors even without fallback models fix(#1743): retry primary model on transient errors even without fallback models Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retryable streaming errors not retried when no fallback models configured

1 participant