Skip to content

autoresume state handling improvements#2196

Open
matthewlouisbrockman wants to merge 30 commits intomainfrom
connect-autoresume-state-handling
Open

autoresume state handling improvements#2196
matthewlouisbrockman wants to merge 30 commits intomainfrom
connect-autoresume-state-handling

Conversation

@matthewlouisbrockman
Copy link
Contributor

@matthewlouisbrockman matthewlouisbrockman commented Mar 21, 2026

auto-resume was not mirroring the state handling like in /resume and /connect. On catalog miss, it could return routing for sandboxes still in pausing/snapshotting transition, which let post-connect run traffic hit 502s.

Changes:

  • orchestrator now has HandleExistingSandboxAutoResume in autoresume.go to handle the state transitions.

Note

Medium Risk
Touches auto-resume/resume routing and orchestrator state handling, which can change when requests fall back to DB-backed resumes or return 409s; misclassification could increase resume failures or transient errors.

Overview
Auto-resume now checks orchestrator state before attempting DB-backed resume: running sandboxes return routing immediately, pausing/snapshotting sandboxes are waited on and re-checked (bounded retries/time budget), killing/unknown states fail, and snapshot metadata is reloaded before fallback to avoid resuming from stale pre-pause data. A new shared SandboxStillTransitioningMessage is returned as FailedPrecondition and is propagated through client-proxy into a dedicated SandboxStillTransitioningError, which the HTTP proxy renders as a 409 (JSON or browser HTML) with added tests covering the new state-machine and error-template behavior.

Written by Cursor Bugbot for commit 2eb2308. This will update automatically on new commits. Configure here.

@matthewlouisbrockman
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa2c4735a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Tight spin loop on failed snapshot transition
    • Added a retry backoff when sandbox state remains snapshotting after refresh so auto-resume no longer busy-spins on failed snapshot transitions.

Create PR

Or push these changes by commenting:

@cursor push 8d53281ca5
Preview (8d53281ca5)
diff --git a/packages/api/internal/handlers/proxy_grpc.go b/packages/api/internal/handlers/proxy_grpc.go
--- a/packages/api/internal/handlers/proxy_grpc.go
+++ b/packages/api/internal/handlers/proxy_grpc.go
@@ -24,6 +24,8 @@
 	sharedutils "github.com/e2b-dev/infra/packages/shared/pkg/utils"
 )
 
+const snapshottingStateRetryDelay = 100 * time.Millisecond
+
 type SandboxService struct {
 	proxygrpc.UnimplementedSandboxServiceServer
 
@@ -134,6 +136,14 @@
 
 			updatedSandbox, getSandboxErr := getSandbox(ctx)
 			if getSandboxErr == nil {
+				if sbx.State == sandbox.StateSnapshotting && updatedSandbox.State == sandbox.StateSnapshotting {
+					select {
+					case <-time.After(snapshottingStateRetryDelay):
+					case <-ctx.Done():
+						return "", false, status.Error(codes.Internal, "error waiting for sandbox snapshot to finish")
+					}
+				}
+
 				sbx = updatedSandbox
 
 				continue

diff --git a/packages/api/internal/handlers/proxy_grpc_test.go b/packages/api/internal/handlers/proxy_grpc_test.go
--- a/packages/api/internal/handlers/proxy_grpc_test.go
+++ b/packages/api/internal/handlers/proxy_grpc_test.go
@@ -5,6 +5,7 @@
 	"errors"
 	"fmt"
 	"testing"
+	"time"
 
 	"github.com/google/uuid"
 	"github.com/stretchr/testify/assert"
@@ -404,6 +405,45 @@
 		assert.Equal(t, "error waiting for sandbox snapshot to finish", st.Message())
 	})
 
+	t.Run("snapshotting sandbox with no active transition does not spin", func(t *testing.T) {
+		t.Parallel()
+
+		ctx, cancel := context.WithTimeout(t.Context(), 20*time.Millisecond)
+		defer cancel()
+
+		waitCalls := 0
+		getCalls := 0
+
+		_, handled, err := handleExistingSandboxAutoResume(
+			ctx,
+			"test-sandbox",
+			testSandboxForAutoResume(sandbox.StateSnapshotting),
+			func(context.Context) error {
+				waitCalls++
+
+				return nil
+			},
+			func(context.Context) (sandbox.Sandbox, error) {
+				getCalls++
+
+				return testSandboxForAutoResume(sandbox.StateSnapshotting), nil
+			},
+			func(sandbox.Sandbox) (string, error) {
+				t.Fatal("getNodeIP should not be called while sandbox remains snapshotting")
+
+				return "", nil
+			},
+		)
+		require.Error(t, err)
+		assert.False(t, handled)
+		st, ok := status.FromError(err)
+		require.True(t, ok)
+		assert.Equal(t, codes.Internal, st.Code())
+		assert.Equal(t, "error waiting for sandbox snapshot to finish", st.Message())
+		assert.Equal(t, 1, waitCalls)
+		assert.Equal(t, 1, getCalls)
+	})
+
 	t.Run("pausing sandbox returns internal error when refreshed sandbox lookup fails unexpectedly", func(t *testing.T) {
 		t.Parallel()

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

@matthewlouisbrockman matthewlouisbrockman marked this pull request as ready for review March 22, 2026 08:08
@matthewlouisbrockman matthewlouisbrockman assigned dobrac and unassigned levb Mar 22, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 964c63291e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@matthewlouisbrockman matthewlouisbrockman marked this pull request as draft March 22, 2026 08:15
@matthewlouisbrockman matthewlouisbrockman marked this pull request as ready for review March 22, 2026 09:15
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99c73584aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@jakubno jakubno assigned jakubno and unassigned dobrac Mar 23, 2026
Copy link
Member

@jakubno jakubno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just NITs

@cursor
Copy link

cursor bot commented Mar 24, 2026

PR Summary

Medium Risk
Changes auto-resume control flow and error propagation across API, orchestrator, and proxies, which could alter client-visible behavior during sandbox transitions. Risk is moderated by added unit/integration tests for the new transition and error-handling paths.

Overview
Improves sandbox auto-resume to avoid returning routing information while a sandbox is still pausing/snapshotting by adding orchestrator-side state handling with bounded waiting/retries, reloading snapshot metadata before resume, and propagating a dedicated FailedPrecondition (sandbox is still transitioning) signal end-to-end so the client proxy and shared proxy can surface a 409 "still transitioning" response (JSON or browser HTML) instead of 502s.

Written by Cursor Bugbot for commit 724c237. This will update automatically on new commits. Configure here.

# Conflicts:
#	packages/api/internal/handlers/proxy_grpc.go
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants