fix: surface WebRTC disconnection to frontend and add auto-reconnect#642
fix: surface WebRTC disconnection to frontend and add auto-reconnect#642livepeer-tessa wants to merge 6 commits intomainfrom
Conversation
When WebRTC connection to cloud is lost (state: disconnected/failed): 1. Surface error to frontend via /api/v1/cloud/status - New field: webrtc_error with message 'Remote inference connection lost' - New field: webrtc_reconnecting to indicate reconnection in progress - Error also appears in main 'error' field for visibility 2. Auto-reconnect with exponential backoff (1s, 2s, 4s) - Up to 3 reconnection attempts - Uses last known parameters for reconnection - Falls back to user-friendly error message if all attempts fail Addresses #640: video output freeze when cloud connection fails during plugin installation. Previously, WebRTC would disconnect silently and the video would freeze without user feedback. Signed-off-by: livepeer-tessa <tessa@livepeer.org> Signed-off-by: livepeer-robot <robot@livepeer.org>
|
Important Review skippedThis PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback. ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds WebRTC reconnection and error-tracking to the cloud connection manager, client-side intentional-disconnect handling, a guarded debug endpoint to force-close WebRTC for tests, schema fields exposing WebRTC error/reconnecting state, and an E2E test exercising disconnect and reconnection with backoff. Changes
Sequence Diagram(s)sequenceDiagram
participant Frontend as Frontend
participant CloudMgr as CloudConnectionManager
participant WebRTC as CloudWebRTCClient
participant Backoff as ReconnectBackoff
Frontend->>CloudMgr: start stream / send parameters
CloudMgr->>WebRTC: start_webrtc(initial_params)
WebRTC-->>CloudMgr: connected
Frontend->>CloudMgr: poll get_status()
CloudMgr-->>Frontend: status (webrtc connected)
Note over WebRTC,CloudMgr: unexpected disconnect
WebRTC-->>CloudMgr: on_webrtc_connection_lost(reason)
CloudMgr-->>Frontend: status (webrtc_error, webrtc_reconnecting=true)
CloudMgr->>Backoff: schedule _attempt_webrtc_reconnect() (1,2,4s)
Backoff->>WebRTC: reconnect attempt (uses _last_webrtc_params)
WebRTC-->>CloudMgr: reconnect success
CloudMgr-->>Frontend: status (webrtc_error cleared, webrtc_reconnecting=false)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scope/server/cloud_connection.py`:
- Around line 636-637: The code currently only seeds self._last_webrtc_params
from initial_parameters, causing reconnection to use stale params; update the
logic so every time send_parameters() successfully applies new live WebRTC/cloud
parameters it updates self._last_webrtc_params to the most recently applied
parameter set (including prompts, noise, etc.); find the send_parameters()
method and add an assignment to refresh self._last_webrtc_params after
successful application (and ensure any async/thread contexts protect the
assignment if needed).
- Around line 639-641: The reset of the reconnect budget in start_webrtc is
clearing _webrtc_reconnect_attempts and _webrtc_error on every re-entry,
preventing _attempt_webrtc_reconnect from ever exhausting its 3-attempt budget;
remove or guard the lines that set self._webrtc_error = None and
self._webrtc_reconnect_attempts = 0 from unconditional execution in start_webrtc
and instead initialize those fields only when performing a fresh/start
connection (e.g., where a new connection is first initiated), so that
_attempt_webrtc_reconnect can increment and eventually stop after the configured
retries.
- Around line 700-724: The reconnect loop may tear down a WebRTC client that has
already recovered during the backoff; after the await asyncio.sleep(delay) and
before cleaning up _webrtc_client, re-check whether a healthy client is already
present (e.g., inspect self._webrtc_client and/or a connectivity flag on it
and/or compare to self._last_webrtc_params and self.is_connected) and abort the
reconnect sequence if so, skipping the disconnect and start_webrtc call; keep
using the same symbols in this function (_webrtc_client, _last_webrtc_params,
start_webrtc, is_connected, _webrtc_reconnect_attempts) so the check runs right
before the existing "Clean up old client" block.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 293909d6-7566-4ab0-b50d-6ef484e39b15
📒 Files selected for processing (3)
src/scope/server/cloud_connection.pysrc/scope/server/cloud_webrtc_client.pysrc/scope/server/schema.py
🚀 fal.ai Preview Deployment
TestingConnect to this preview deployment by running this on your branch: 🧪 E2E tests will run automatically against this deployment. |
✅ E2E Tests passed
Test ArtifactsCheck the workflow run for screenshots. |
Adds Playwright test for the reconnection logic: 1. Debug endpoint: POST /api/v1/cloud/debug/disconnect-webrtc - Force-closes WebRTC peer connection to simulate network disconnect - Triggers the same code path as a real disconnect 2. E2E test: webrtc-reconnect.spec.ts - Connects to cloud and starts passthrough stream - Triggers disconnect via debug endpoint - Verifies error surfaces via /api/v1/cloud/status - Waits for auto-reconnection (up to 60s for backoff) - Verifies stream recovers after reconnection Also includes a schema verification test to ensure the new webrtc_error and webrtc_reconnecting fields are present. Signed-off-by: livepeer-tessa <tessa@livepeer.org> Signed-off-by: livepeer-robot <robot@livepeer.org>
There was a problem hiding this comment.
🧹 Nitpick comments (3)
e2e/tests/webrtc-reconnect.spec.ts (2)
228-252: verifyErrorSurfaced silently passes when error is not detected, which may mask test failures.The function logs a warning at line 251 but doesn't fail the test if
webrtc_errororwebrtc_reconnectingis never observed. This could lead to false positives if the error surfacing mechanism is broken but the connection happens to recover before the 10-second polling window.Consider throwing an error or at least returning a boolean that the caller can assert on:
♻️ Proposed fix to surface detection result
async function verifyErrorSurfaced(page: Page) { console.log("Verifying error is surfaced..."); // Poll the status endpoint for the error const MAX_WAIT_MS = 10000; const POLL_MS = 1000; const start = Date.now(); while (Date.now() - start < MAX_WAIT_MS) { const response = await page.request.get(`${API_BASE}/api/v1/cloud/status`); const status = await response.json(); if (status.webrtc_error || status.webrtc_reconnecting) { console.log( `✅ Error surfaced: ${status.webrtc_error || "(reconnecting)"}` ); - return; + return true; } await page.waitForTimeout(POLL_MS); } - // The connection might have already recovered, which is also acceptable - console.log("⚠️ Error may have been transient (already recovered)"); + // The connection might have already recovered + console.log("⚠️ Error may have been transient (connection recovered quickly)"); + return false; }Then at the call site, you can decide whether to assert:
const errorWasSurfaced = await verifyErrorSurfaced(page); // If strict validation is needed: // expect(errorWasSurfaced).toBeTruthy();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e/tests/webrtc-reconnect.spec.ts` around lines 228 - 252, The verifyErrorSurfaced function currently only logs a warning when no webrtc_error or webrtc_reconnecting is observed, which can hide test failures; update verifyErrorSurfaced(page: Page) to return a boolean (true if status.webrtc_error or status.webrtc_reconnecting was seen) and/or throw an Error when the polling times out without seeing the condition, and update its callers to assert the result (e.g. expect(await verifyErrorSurfaced(page)).toBeTruthy() or handle the thrown error). Keep the existing logs but ensure the function returns false or throws on timeout so tests can fail deterministically; reference the verifyErrorSurfaced function and the usage of status.webrtc_error / status.webrtc_reconnecting and API_BASE in the implementation and call sites.
77-99: Second test validates schema but doesn't test failure scenarios as named.The test is named "shows error message when reconnection fails" but only validates that the status response includes the expected fields. The comment on line 78 acknowledges this. Consider renaming the test to reflect what it actually tests, or adding a TODO to implement the actual failure scenario.
✏️ Suggested rename for clarity
- test("shows error message when reconnection fails", async ({ page }) => { - // This test would require mocking the cloud endpoint to fail - // For now, we just verify the error state is accessible via API + test("status endpoint includes reconnection fields in schema", async ({ page }) => { + // TODO: Add a separate test for reconnection failure when cloud mocking is available test.setTimeout(60000);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@e2e/tests/webrtc-reconnect.spec.ts` around lines 77 - 99, The test name test("shows error message when reconnection fails", ...) is misleading because the body only checks the status schema; rename the test to something like "exposes webrtc reconnection status fields" or add a TODO comment above the test indicating the need to mock the cloud endpoint and implement the actual failure scenario; update the test title string in the test(...) call (and optionally add a TODO referencing mocking the cloud endpoint and verifying UI error display) while leaving the existing assertions that reference statusResponse and status intact.src/scope/server/app.py (1)
2552-2576: Consider guarding this debug endpoint for non-production use.This endpoint can forcibly disconnect WebRTC connections, which could be disruptive if exposed in production. Consider adding a guard such as checking an environment variable or a debug mode flag.
🛡️ Suggested guard for production environments
`@app.post`("/api/v1/cloud/debug/disconnect-webrtc") async def debug_disconnect_webrtc( cloud_manager: "CloudConnectionManager" = Depends(get_cloud_connection_manager), ): """Force-close the WebRTC connection to simulate a disconnect. This endpoint is intended for E2E testing of the reconnection logic. It triggers the same code path as a real network disconnect. Returns the cloud status after triggering the disconnect. """ + # Only allow in development/testing environments + if not os.environ.get("SCOPE_DEBUG_ENDPOINTS"): + raise HTTPException( + status_code=404, + detail="Not found", + ) + if not cloud_manager.webrtc_connected: raise HTTPException( status_code=400, detail="WebRTC not connected", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scope/server/app.py` around lines 2552 - 2576, The debug_disconnect_webrtc endpoint allows forcing WebRTC disconnects and must be disabled in production; update the debug_disconnect_webrtc handler to check a production-safe guard (e.g., an environment variable like ENABLE_DEBUG_ENDPOINTS or an application debug flag / settings.DEBUG) at the start of the function and raise HTTPException(status_code=403, detail="Debug endpoint disabled") if the guard is not enabled. Use standard libs (os.getenv) or the existing app/config debug flag, log denied attempts, and keep the rest of the logic (CloudConnectionManager, _webrtc_client, pc.close(), and get_status()) unchanged so the endpoint only runs when the guard permits it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@e2e/tests/webrtc-reconnect.spec.ts`:
- Around line 228-252: The verifyErrorSurfaced function currently only logs a
warning when no webrtc_error or webrtc_reconnecting is observed, which can hide
test failures; update verifyErrorSurfaced(page: Page) to return a boolean (true
if status.webrtc_error or status.webrtc_reconnecting was seen) and/or throw an
Error when the polling times out without seeing the condition, and update its
callers to assert the result (e.g. expect(await
verifyErrorSurfaced(page)).toBeTruthy() or handle the thrown error). Keep the
existing logs but ensure the function returns false or throws on timeout so
tests can fail deterministically; reference the verifyErrorSurfaced function and
the usage of status.webrtc_error / status.webrtc_reconnecting and API_BASE in
the implementation and call sites.
- Around line 77-99: The test name test("shows error message when reconnection
fails", ...) is misleading because the body only checks the status schema;
rename the test to something like "exposes webrtc reconnection status fields" or
add a TODO comment above the test indicating the need to mock the cloud endpoint
and implement the actual failure scenario; update the test title string in the
test(...) call (and optionally add a TODO referencing mocking the cloud endpoint
and verifying UI error display) while leaving the existing assertions that
reference statusResponse and status intact.
In `@src/scope/server/app.py`:
- Around line 2552-2576: The debug_disconnect_webrtc endpoint allows forcing
WebRTC disconnects and must be disabled in production; update the
debug_disconnect_webrtc handler to check a production-safe guard (e.g., an
environment variable like ENABLE_DEBUG_ENDPOINTS or an application debug flag /
settings.DEBUG) at the start of the function and raise
HTTPException(status_code=403, detail="Debug endpoint disabled") if the guard is
not enabled. Use standard libs (os.getenv) or the existing app/config debug
flag, log denied attempts, and keep the rest of the logic
(CloudConnectionManager, _webrtc_client, pc.close(), and get_status()) unchanged
so the endpoint only runs when the guard permits it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ac40e676-21b6-45f4-9012-920c981d2615
📒 Files selected for processing (2)
e2e/tests/webrtc-reconnect.spec.tssrc/scope/server/app.py
The debug endpoint calls pc.close() which sets connection state to 'closed', not 'disconnected' or 'failed'. Added intentional disconnect tracking to distinguish user-initiated closes from unexpected ones. - Add _intentional_disconnect flag to CloudWebRTCClient - Set flag before calling pc.close() in disconnect() - Reset flag when starting new connection - Handle 'closed' state same as 'disconnected'/'failed' when not intentional Signed-off-by: livepeer-tessa <tessa@livepeer.org> Signed-off-by: livepeer-robot <robot@livepeer.org>
1. Fix stale params on reconnect: - Update _last_webrtc_params in send_parameters() after successful send - Merge new params with existing to preserve complete state 2. Fix reconnect budget reset issue: - Add _is_reconnecting flag to track reconnection state - Only reset _webrtc_error and _webrtc_reconnect_attempts on fresh start - Prevents infinite retry loop when start_webrtc() is called from reconnect 3. Fix race condition in reconnect loop: - Check webrtc_connected after backoff sleep - Abort reconnection if connection recovered during wait - Also check if cloud connection was lost during sleep 4. Guard debug endpoint for production: - Only allow when SCOPE_DEBUG_ENDPOINTS=1 env var is set - Return 404 in production to hide endpoint existence - Update CI workflow to set env var for E2E tests 5. E2E test improvements: - Rename misleading test to 'status endpoint includes WebRTC reconnection fields' - verifyErrorSurfaced() now returns boolean for proper assertions - Add 'preserves parameters across reconnection' test - Add 'surfaces error when reconnection fails multiple times' test Signed-off-by: livepeer-tessa <tessa@livepeer.org> Signed-off-by: livepeer-robot <robot@livepeer.org>
Signed-off-by: livepeer-tessa <tessa@livepeer.org> Signed-off-by: livepeer-robot <robot@livepeer.org>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/scope/server/app.py (1)
2565-2565: Redundant import.
osis already imported at the module level (line 6). The local import here is unnecessary.♻️ Remove redundant import
- import os - # Guard: Only allow in development/testing environments if not os.environ.get("SCOPE_DEBUG_ENDPOINTS"):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scope/server/app.py` at line 2565, Redundant local import: remove the unnecessary "import os" added near the bottom of src/scope/server/app.py (the duplicate import referenced in the diff) since "os" is already imported at module level; simply delete that local import statement to avoid duplication and keep the module-level import only..github/workflows/docker-build.yml (1)
299-302: Redundant environment variable setting.
SCOPE_DEBUG_ENDPOINTSis set both in theenv:block (line 299) and as an inline prefix on the command (line 302). Theenv:block should suffice for the entire step. Consider removing the inline prefix for clarity:♻️ Suggested simplification
env: SCOPE_CLOUD_APP_ID: ${{ needs.deploy-pr.outputs.fal_app_id }} SCOPE_DEBUG_ENDPOINTS: "1" run: | echo "Starting Scope with cloud app ID: $SCOPE_CLOUD_APP_ID" - SCOPE_DEBUG_ENDPOINTS=1 uv run daydream-scope > scope-app.log 2>&1 & + uv run daydream-scope > scope-app.log 2>&1 &🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/docker-build.yml around lines 299 - 302, The SCOPE_DEBUG_ENDPOINTS environment variable is redundantly set both in the step env block and as an inline prefix on the command; remove the inline prefix "SCOPE_DEBUG_ENDPOINTS=1" from the command that runs uv (the line invoking "uv run daydream-scope > scope-app.log 2>&1 &") so the step uses the env block value for SCOPE_DEBUG_ENDPOINTS and keep the rest of the command intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@e2e/tests/webrtc-reconnect.spec.ts`:
- Around line 95-101: The test is calling a non-existent endpoint
`/api/v1/pipeline/parameters`; replace that POST with a call to the existing
pipeline load endpoint so parameters are applied: send testParams via
page.request.post to `${API_BASE}/api/v1/pipeline/load` (using the same
testParams variable) and store the result in updateResponse, then assert the
response status is successful (e.g., 200) instead of only logging it; update the
code references to testParams, updateResponse, and API_BASE accordingly so the
test uses the supported endpoint to apply parameters during the reconnect
scenario.
In `@src/scope/server/cloud_connection.py`:
- Around line 79-86: The file failed Ruff formatting; run an automatic formatter
or apply the same style used project-wide to fix spacing/annotations around the
WebRTC-related attributes (e.g., _webrtc_error, _webrtc_reconnect_task,
_webrtc_reconnect_attempts, _max_webrtc_reconnect_attempts, _last_webrtc_params,
_is_reconnecting) — specifically run `ruff format
src/scope/server/cloud_connection.py` (or your project's formatter) and commit
the resulting changes so the attribute declarations and type hints match the
repository's linting rules.
---
Nitpick comments:
In @.github/workflows/docker-build.yml:
- Around line 299-302: The SCOPE_DEBUG_ENDPOINTS environment variable is
redundantly set both in the step env block and as an inline prefix on the
command; remove the inline prefix "SCOPE_DEBUG_ENDPOINTS=1" from the command
that runs uv (the line invoking "uv run daydream-scope > scope-app.log 2>&1 &")
so the step uses the env block value for SCOPE_DEBUG_ENDPOINTS and keep the rest
of the command intact.
In `@src/scope/server/app.py`:
- Line 2565: Redundant local import: remove the unnecessary "import os" added
near the bottom of src/scope/server/app.py (the duplicate import referenced in
the diff) since "os" is already imported at module level; simply delete that
local import statement to avoid duplication and keep the module-level import
only.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bab06267-07ef-4e35-89a5-2a37615e07ac
📒 Files selected for processing (4)
.github/workflows/docker-build.ymle2e/tests/webrtc-reconnect.spec.tssrc/scope/server/app.pysrc/scope/server/cloud_connection.py
e2e/tests/webrtc-reconnect.spec.ts
Outdated
| const testParams = { prompt: "test-reconnection-prompt-" + Date.now() }; | ||
| const updateResponse = await page.request.post( | ||
| `${API_BASE}/api/v1/pipeline/parameters`, | ||
| { data: testParams } | ||
| ); | ||
| // Parameter update may succeed or fail depending on pipeline state | ||
| console.log(`Parameter update status: ${updateResponse.status()}`); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for the pipeline parameters endpoint in the codebase
rg -n "pipeline/parameters" --type py
rg -n "@app\.(post|put|patch).*parameter" --type py src/scope/server/app.pyRepository: daydreamlive/scope
Length of output: 44
🏁 Script executed:
#!/bin/bash
# Find app.py file
find . -name "app.py" -type f
# Look for any parameter-related API endpoints
rg -n "api/v1" --type py | head -20
# Search more broadly for parameter endpoints
rg -n "/parameters" --type py
# Check if the endpoint might be in a different location or pattern
rg -n "pipeline" --type py | grep -i parameterRepository: daydreamlive/scope
Length of output: 4818
🏁 Script executed:
#!/bin/bash
# Check the size of app.py
wc -l ./src/scope/server/app.py
# Search for all route definitions in app.py
rg -n "@app\.(post|get|put|patch|delete)" ./src/scope/server/app.pyRepository: daydreamlive/scope
Length of output: 2558
🏁 Script executed:
#!/bin/bash
# Read the test file around lines 95-101
sed -n '80,120p' e2e/tests/webrtc-reconnect.spec.tsRepository: daydreamlive/scope
Length of output: 1507
The /api/v1/pipeline/parameters endpoint does not exist and cannot be used to update parameters.
This test attempts to update pipeline parameters via a POST request to /api/v1/pipeline/parameters (lines 95-97), but this endpoint does not exist in app.py. The available pipeline endpoints are /api/v1/pipeline/load and /api/v1/pipeline/status only. Since the test does not assert on the response (only logs status), it silently fails to test parameter preservation as intended. Either implement the missing endpoint or use an existing mechanism to update parameters during this test.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@e2e/tests/webrtc-reconnect.spec.ts` around lines 95 - 101, The test is
calling a non-existent endpoint `/api/v1/pipeline/parameters`; replace that POST
with a call to the existing pipeline load endpoint so parameters are applied:
send testParams via page.request.post to `${API_BASE}/api/v1/pipeline/load`
(using the same testParams variable) and store the result in updateResponse,
then assert the response status is successful (e.g., 200) instead of only
logging it; update the code references to testParams, updateResponse, and
API_BASE accordingly so the test uses the supported endpoint to apply parameters
during the reconnect scenario.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scope/server/cloud_connection.py`:
- Around line 674-676: stop_webrtc() currently clears _webrtc_error and
_webrtc_reconnect_attempts but leaves _last_webrtc_params set, allowing a
subsequent start_webrtc(None) to inherit stale params; update stop_webrtc() to
also clear the cached reconnect parameters by setting _last_webrtc_params to
None so an explicit shutdown does not leak previous session prompts/noise into
future reconnects (refer to stop_webrtc(), _last_webrtc_params, and
start_webrtc() when applying the change).
- Around line 663-667: stop_webrtc() cancels self._webrtc_reconnect_task but
doesn't await it, so the task's finally block in _attempt_webrtc_reconnect() may
not run and _is_reconnecting can remain True; modify stop_webrtc() to await the
cancelled task after calling self._webrtc_reconnect_task.cancel() (e.g., try:
await self._webrtc_reconnect_task except asyncio.CancelledError: pass) before
setting self._webrtc_reconnect_task = None so the reconnect task's cleanup runs
and _is_reconnecting / retry state are cleared before a restart.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2fe9c2be-63e6-4ddf-b154-c266279d24ac
📒 Files selected for processing (1)
src/scope/server/cloud_connection.py
1. Fix stop_webrtc() to properly await cancelled reconnect task: - Await the task after cancelling to ensure finally block runs - Clear _is_reconnecting flag on explicit stop 2. Clear _last_webrtc_params on stop: - Prevents stale params leaking into future sessions - Start with fresh state on next connect 3. Fix E2E test using non-existent endpoint: - Removed call to /api/v1/pipeline/parameters (doesn't exist) - Parameters are sent via WebRTC data channel, not HTTP - Simplified test to focus on reconnection verification 4. Remove redundant env var in workflow: - SCOPE_DEBUG_ENDPOINTS already set in env block - No need for inline prefix on command Signed-off-by: livepeer-tessa <tessa@livepeer.org> Signed-off-by: livepeer-robot <robot@livepeer.org>
Summary
When WebRTC connection to cloud is lost during streaming (e.g., network hiccup, plugin installation failure), this PR:
/api/v1/cloud/statusAddresses #640
Video output freeze when cloud connection fails during plugin installation. Previously, WebRTC would disconnect silently and the video would freeze without user feedback.
Architecture
Reconnection Flow
sequenceDiagram participant F as Frontend participant B as Backend participant C as Cloud (fal.ai) Note over B,C: Normal streaming B->>C: Video frames C->>B: Processed frames Note over B,C: Connection lost C--xB: WebRTC disconnected B->>B: connectionstatechange → 'closed' B->>B: Set webrtc_error B->>B: Start reconnect task loop Exponential Backoff (1s, 2s, 4s) B->>B: Sleep (delay) alt Connection recovered during sleep B->>B: Abort reconnection ✓ else Cloud connection still valid B->>C: start_webrtc(last_params) alt Success B->>B: Clear error, reset attempts B->>C: Resume streaming ✓ else Failure B->>B: Increment attempt counter end end end alt All 3 attempts failed B->>B: Set final error message F->>B: GET /api/v1/cloud/status B->>F: {webrtc_error: 'Please toggle cloud mode...'} endParameter Preservation
sequenceDiagram participant U as User participant B as Backend participant C as Cloud U->>B: Update prompt B->>B: send_parameters({prompt: 'new'}) B->>B: _last_webrtc_params.update({prompt: 'new'}) B->>C: Data channel: parameters Note over B,C: Connection lost C--xB: Disconnect B->>B: _attempt_webrtc_reconnect() B->>C: start_webrtc(_last_webrtc_params) Note right of B: Parameters preserved! C->>B: Reconnected with latest paramsIssues Fixed (CodeRabbit Review)
send_parameters()now updates_last_webrtc_params_is_reconnectingflag to guard budget resetwebrtc_connectedbefore proceedingSCOPE_DEBUG_ENDPOINTSenv varAPI Changes
Status Response (new fields)
{ "webrtc_connected": false, "webrtc_error": "Remote inference connection lost: WebRTC state closed", "webrtc_reconnecting": true }Debug Endpoint
Only available when
SCOPE_DEBUG_ENDPOINTS=1. Returns 404 in production.Files Changed
cloud_connection.pycloud_webrtc_client.pyschema.pywebrtc_error,webrtc_reconnectingfieldsapp.pydocker-build.ymlSCOPE_DEBUG_ENDPOINTS=1for E2E testswebrtc-reconnect.spec.tsE2E Tests
Testing
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores