Deniz/debug by dzorlu · Pull Request #417 · meta-pytorch/OpenEnv

dzorlu · 2026-03-04T03:00:40Z

Summary

Type of Change

Alignment Checklist

Before submitting, verify:

I have read .claude/docs/PRINCIPLES.md and this PR aligns with our principles
I have checked .claude/docs/INVARIANTS.md and no invariants are violated
I have run /pre-submit-pr (or bash .claude/hooks/lint.sh and tests) and addressed all issues

RFC Status

Not required (bug fix, docs, minor refactoring)
RFC exists: #___
RFC needed (will create before merge)

Test Plan

Claude Code Review

- FleetTaskEnv wraps FleetEnvClient with task-oriented interface - Accepts task configs from export_training_tasks.py - Creates versioned environments on reset - Injects task prompt into observations - Executes verifier for reward computation on episode completion - Supports both sync and async step methods - Factory functions: make_fleet_task_env, from_json_file - Tests: 20 unit tests for init, specs, verifiers, factories 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The MCP images don't exist for all environment versions, causing FleetVersionNotFoundError when trying to create environments. Changing the default to None allows the Fleet SDK to use standard images which are available for all versions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

FleetEnvClient.from_fleet() was not accepting data_key/data_version parameters, causing them to be passed through **kwargs to HTTPEnvClient which doesn't accept them. - Add data_key and data_version as explicit parameters - Pass them to fleet.make() - Update task_env.py to pass them separately 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fleet SDK expects data_key in "key:version" format, not as separate parameters. Updated from_fleet() to combine them before calling fleet.make(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

HTTPEnvClient.reset() doesn't support seed parameter yet. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Increases default timeout from 15s to 60s for Fleet API calls. This prevents timeouts during environment initialization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Previously reset() did partial work and reset_async() added tool fetching. Now reset_async() does all the work (including fetching tools) and reset() is just a sync wrapper that calls it via run_until_complete(). This ensures both methods return identical results including tools. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

MCP's call_tool() returns a CallToolResult Pydantic object, not plain text. This was causing ugly repr strings to be passed to agents like: "meta=None content=[TextContent(type='text', text='...')] ..." Now properly extracts: - Text content from result.content[].text - Tries JSON parsing for structured results - Falls back to structuredContent if available - Handles isError cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tests for: - FleetMCPClient._extract_tool_result(): - Single text content extraction - JSON parsing from text - Multiple text contents - Error result handling - Structured content fallback - Empty result handling - FleetTaskEnv reset: - reset_async() returns tools - reset() calls reset_async() (sync wrapper) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Move fleet.make() and list_tools() into FleetTaskEnv.__init__() - Tools are now fetched at env creation, not during reset - reset_async() calls _orch.reset() with error handling, returns cached tools - Use asyncio.run() for Python 3.13 compatibility - Update tests for new initialization pattern 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Log task_key and verifier code preview when verifier fails - Catch syntax errors separately with clear message - Show which functions were found if 'verify' is missing Helps debug issues like "Verifier code must define a 'verify' function" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace custom _execute_verifier_local() with Fleet SDK's Task.verify_detailed() which properly sets up the verifier namespace with: - Environment type annotation - Helper functions (normalized_contains, etc.) - Proper function discovery (not just "verify" function) This fixes "name 'Environment' is not defined" errors during verifier execution. Changes: - _compute_reward: Create Fleet SDK Task and call verify_detailed() - Support both 'verifier_code' and 'verifier_func' field names - Add comprehensive logging for debugging - Remove broken _execute_verifier_local method Tests: - Update all verifier tests to mock Fleet SDK Task.verify_detailed() - Add tests for various edge cases (no verifier, no orch, exceptions) - Fix fixture to avoid asyncio.run() conflicts with pytest-asyncio 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…context

- Add retry with exponential backoff (3 attempts, 1s/2s/4s delays) - Log errors instead of silently swallowing exceptions - Log warning when some clients fail but others succeed - Log error after all retries exhausted This fixes silent failures when MCP connections are flaky, which caused 'no tools found' errors in SkyRL training.

call_tool now retries with exponential backoff (3 attempts, 1s/2s/4s) on connection errors, similar to list_tools. ValueError (tool not found) is not retried. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds exponential backoff retry (3 attempts, 2s base delay) around fleet.make() to handle transient Fleet API errors like health check failures that can occur during instance provisioning. Only retries on transient errors (health check, timeout, connection). Permanent errors are raised immediately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add Toolathlon-style context management tools for long trajectories: - check_context: Check visible/total turn counts - manage_context: Drop old turns to free up context space - search_history: Search all history (including dropped) - search_tool_output: Search truncated tool output - view_tool_output: Paginate through truncated output The ContextManager class can be used by any training framework that maintains chat_history. It tracks full history and handles truncated tool outputs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Computer-use tasks require MCP-enabled container images (e.g., famazon:mcp0.0.7) which have scrot installed for screenshots and the MCP server with 'computer' tool for mouse/keyboard control.

Previously, tools were only fetched for tool_use modality due to a restrictive condition. This caused computer_use tasks to fail with "no tools found in observation" because the computer tool (mouse, keyboard, screenshot) was never fetched. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When task_modality is computer_use, filter tools to only include the 'computer' tool. This prevents the model from using API tools when it should be using mouse/keyboard control. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The manager API (POST /reset) hangs indefinitely on some env images (e.g. google-maps v0.0.53). Since reset failure is already handled gracefully (warning + continue), this adds a short dedicated timeout (default 10s) so the reset fails fast instead of blocking for the full request_timeout_s (60-120s). This saves 50-110s per episode during training when the manager API is unresponsive, while still allowing reset to succeed on healthy envs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add env_key property to FleetTaskEnv - Prefix all error/warning logs with [env=X] for easy filtering - Helps identify which environments have infrastructure issues (502s, health checks) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Defense-in-depth against zombie threads: if asyncio cancellation somehow fails to propagate, HTTP-level timeouts ensure MCP calls fail within 2 minutes instead of hanging forever. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- mcp_tools.py: raise RuntimeError after 3 failed list_tools attempts instead of silently returning empty ListToolsAction - mcp_tools.py: increase retry_base_delay from 1s to 2s - task_env.py: don't set _tools_fetched=True on failure so next reset_async() can retry tool discovery Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Structured observability for fleet env errors (init failures, tool call failures, MCP timeouts, verifier errors). Adds telemetry.py wrapper and 15 instrumentation sites across task_env.py, client.py, mcp_tools.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…, modality - Add set_task_context() to establish base attributes for all events - All telemetry events now inherit env_key, env_version, task_key, modality - Parse env_key:version in client.py to log separately - Add fleet_rollout_started and fleet_rollout_completed events - Default environment changed to "training_rollouts" - Update README with new schema and example SQL query 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tracks MCP server errors (returned in tool results) separately from Python exceptions: - fleet_tool_call_failed: Python exception during call_tool() - fleet_mcp_tool_error: MCP server returned {"error": ...}, {"status": "failed"}, or {"isError": true} This aligns telemetry with WandB tool_error counting which tracks both exception-based errors and error patterns in tool results. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… full context - Move set_task_context() before from_fleet() call in task_env.py - Remove explicit env_key/env_version from client.py telemetry calls (now from context) - This ensures fleet_make_failed events include task_key and modality 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove unused fleet_error import from task_env.py - Fix _is_tool_error to check truthy values (avoid {"error": null} false positives) - Make close() exception-safe with try/finally for cleanup - Emit fleet_rollout_completed on ALL paths (not just verifier success) - Remove unused _env_name/_env_version parsing in client.py 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The eval step was hanging for 6+ hours because MCP list_tools/call_tool were not enforcing timeouts during connection establishment. Changes: - Add OPERATION_TIMEOUT_S = 60s hard limit using asyncio.wait_for() - Reduce internal timeouts (30s connect, 60s SSE read) - Raise TimeoutError with clear message when operations hang This prevents a single slow/stuck Fleet env from blocking the entire training run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fleet.make() is synchronous and can block for ~10 minutes per attempt when an env has health check failures (e.g., fostgres). With 3 retries and time.sleep(), this blocks the entire event loop for ~30 minutes, freezing ALL other async trajectories in the batch. Changes: - Add FleetEnvClient.from_fleet_async() using AsyncFleet.make() and asyncio.sleep() for retries — yields to event loop while waiting - Defer provisioning from FleetTaskEnv.__init__() to reset_async() via _ensure_provisioned(), so fleet.make() runs in async context - After async provisioning, get sync env handle via Fleet.instance() for close() and verify_detailed() compatibility (fast GET, ~100ms) - Update tests to match new deferred provisioning pattern No SkyRL changes needed — it already calls __init__() then reset_async(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add console-visible logger.warning() alongside logfire events for fleet_mcp_tool_error and fleet_tool_call_failed. Includes: - env_key:env_version (e.g., amazon:v0.0.12) - step N/max_steps (e.g., step 3/50) - tool_name and error message Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- fleet_env_reset_failed: use fleet_warning (no traceback) instead of fleet_exception — reset 502s are expected, one-line warning is enough - fleet_env_close_failed: silence entirely — "Instance already terminated" is expected when TTL expires before cleanup - fleet_env_created: remove logfire console print — logger.info already prints the useful "Fleet instance ready in Xs: {id}" line Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…teps - Move fleet_rollout_started to fire before provisioning so init failures (e.g., fostgres health check) are counted in total_rollouts - Emit fleet_rollout_completed with failure_reason="init_error" on init failure, ensuring completed <= total_rollouts invariant - Add total_steps (SUM of step_count) and init_errors columns to SQL query - Clarify verifier_errors = code exceptions, not model failures - Add test for init failure telemetry path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…me_s Tracks per-instance provisioning latency in Logfire to diagnose Fleet queue serialization (96 concurrent make() calls get processed at ~1/10s, causing 10+ min queue delays for the last instance). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Route MCP by modality: computer_use → /api/v1/mcp (aggregator, port 8081), tool_use → /mcp (per-env server, port 3003). Eliminates partial failure where aggregator timeout silently dropped the computer tool. - Make data_key, data_version, image_type required args on from_fleet/from_fleet_async. - Emit fleet_rollout_completed on post-provisioning failures (tools_error, computer_tool_missing) — closes telemetry gap where rollouts started but never completed in Logfire. - Match harness retry config: 8s initial wait, 8 retries, exponential backoff capped at 5s. - Fatal failures: list_tools fail/empty and missing computer tool now raise instead of being swallowed as warnings. - Update README with sequence chart, endpoint routing table, failure reasons. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

call_tool() referenced self.retry_base_delay which was removed when matching harness retry config. Would crash with AttributeError on first retry. Now uses min(2**attempt, self.max_backoff) like list_tools(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

computer_use: 1800s (30 min) — browser + inference is slow tool_use: 600s (10 min) — API calls are fast Can be overridden by passing ttl_seconds explicitly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Every fleet_rollout_started now gets a matching fleet_rollout_completed. Previously, rollouts that were stopped by the caller (max_turns, context overflow, job cancellation) never emitted the completed event, creating gaps in the Logfire dashboard (e.g., 132 started but only 9 completed). close() now infers stop_reason from state: - max_steps: step_count >= max_steps - tool_error: last tool call failed (likely TTL expiry) - caller_stopped: steps were taken but caller stopped early - cancelled: rollout started but no steps taken Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

tool_error is transitory (rollout continues after a failed tool call), so it's misleading as a stop reason. Simplified to just two: - max_steps: step_count >= max_steps - abandoned: everything else (caller stopped, cancelled, ctx overflow) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Documents the rollout lifecycle accounting: started = completed + init_err + tools_err + no_computer + max_steps + abandoned Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…expiry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-03-04T03:05:30Z

Greptile Summary

This PR introduces the fleet_env integration — a client-side adapter that connects OpenEnv to Fleet-hosted remote environments via two separate planes: HTTP orchestration (FleetEnvClient) and MCP agent actions (FleetMCPTools). The design correctly follows the RFC 001 split-plane principle and RFC 003's tool-call action model. It also adds FleetTaskEnv, a Gymnasium-compatible wrapper with full telemetry (Logfire), modality-aware tool filtering, and context management utilities.

However, this branch is not merge-ready. The PR title "Deniz/debug" and an explicit "THROWAWAY DEBUG" comment in the code confirm it was pushed for debugging purposes:

Critical bug — from_fleet_async is an infinite loop: The entire body of AsyncFleet.make() was replaced with a _tick_forever() coroutine that prints a log every 10 seconds and never returns. Since FleetTaskEnv._ensure_provisioned() calls this method, every reset() call hangs indefinitely. Additionally, async_env is initialized to None and never assigned, so even after removing the debug block the code that follows would raise AttributeError.
Three tests have missing required arguments: from_fleet() requires data_key, data_version, and image_type but the three sync tests omit them, causing TypeError before any assertion runs.
src/pyproject.toml is missing logfire: telemetry.py imports logfire unconditionally at module level, but the src/ package manifest's fleet extras don't include it, causing an ImportError at import time for consumers of that manifest.

Alignment flags:

ALIGNMENT FLAG: FleetTaskEnv exposes reset() and step() sync wrappers that call asyncio.run() internally. If a caller already has a running event loop (common in async training frameworks like SkyRL), these will raise RuntimeError: This event loop is already running. The async variants reset_async() / step_async() exist and should be the primary interface.
- Principle at stake: Predictable, composable async API (RFC 001 / RFC 002 abstractions).
- Suggested reviewer: @darktex
ALIGNMENT FLAG: The PR explicitly notes it does not implement the container/provider abstraction from RFC 002 (ContainerProvider, local Docker lifecycle). It is a Fleet-specific client with no pluggable backend. This may be intentional as a stepping stone, but the deviation from RFC 002's architecture should be acknowledged with either a follow-up RFC or a note in rfcs/002-env-spec.md.
- Principle at stake: RFC 002 container/provider abstraction.
- Suggested reviewer: @darktex

Confidence Score: 0/5

Not safe to merge — the async provisioning path is permanently broken by intentional throwaway debug code that the author acknowledged should be removed.
The explicit "THROWAWAY DEBUG" infinite loop in from_fleet_async makes the entire FleetTaskEnv reset path non-functional. Combined with failing tests (missing required args) and a missing logfire dependency, this branch cannot be used in production in its current state and should not be merged until the debug code is removed and the follow-on bugs are fixed.
src/envs/fleet_env/client.py (broken from_fleet_async), tests/envs/test_fleet_env.py (failing test signatures), src/pyproject.toml (missing logfire dependency)

Important Files Changed

Filename	Overview
src/envs/fleet_env/client.py	Contains a critical throwaway debug infinite loop in `from_fleet_async` that completely disables environment provisioning; `async_env` is also never assigned, so even after removing the debug block the function would crash.
tests/envs/test_fleet_env.py	Three sync tests call `FleetEnvClient.from_fleet` without the required `data_key`, `data_version`, and `image_type` positional arguments — all three will fail with `TypeError` before executing any assertions.
src/pyproject.toml	Fleet extras are missing `logfire`, which is imported unconditionally at module level in `telemetry.py`; the root `pyproject.toml` correctly includes it but this manifest does not, causing an `ImportError` for consumers of the `src/` package.
src/envs/fleet_env/task_env.py	Well-structured Gymnasium-compatible wrapper with solid telemetry, modality-based tool filtering, and graceful degradation for reset/screenshot failures; relies on the broken `from_fleet_async` so it is non-functional until the debug code is removed.
src/envs/fleet_env/mcp_tools.py	Clean retry/backoff logic and tool-owner routing; the unconditional 8-second `initial_wait` on every `list_tools` call (not just the first) will introduce significant latency in multi-episode workloads.
src/envs/fleet_env/telemetry.py	Thin Logfire wrapper with clean context-variable design; `logfire` is imported unconditionally at module level so the missing dependency in `src/pyproject.toml` will break consumers.

Sequence Diagram

sequenceDiagram
    participant SkyRL as SkyRL Generator
    participant TaskEnv as FleetTaskEnv
    participant Client as FleetEnvClient (HTTP)
    participant Tools as FleetMCPTools (MCP)
    participant Fleet as Fleet Runtime

    SkyRL->>TaskEnv: reset()
    TaskEnv->>TaskEnv: fleet_rollout_started (telemetry)
    TaskEnv->>Client: from_fleet_async() [BROKEN: infinite debug loop]
    Note over Client: ❌ _tick_forever() runs forever<br/>AsyncFleet.make() never called<br/>async_env stays None

    alt Provisioning succeeds (after fix)
        Client->>Fleet: AsyncFleet.make(env_key, image_type, ...)
        Fleet-->>Client: env handle + URLs
        Client-->>TaskEnv: (orch, tools)
        TaskEnv->>Client: orch.reset() [HTTP, short timeout]
        TaskEnv->>Tools: tools.list_tools() [8s wait + retries]
        Tools->>Fleet: MCP list_tools
        Fleet-->>Tools: tool list
        Tools-->>TaskEnv: ListToolsAction
        TaskEnv->>TaskEnv: filter tools by modality
        TaskEnv-->>SkyRL: obs {prompt, tools, screenshot?}
    end

    loop Agent turn
        SkyRL->>TaskEnv: step_async(action)
        TaskEnv->>Tools: call_tool(name, args)
        Tools->>Fleet: MCP call_tool
        Fleet-->>Tools: result
        Tools-->>TaskEnv: result
        TaskEnv-->>SkyRL: (obs, reward, done, info)
    end

    SkyRL->>TaskEnv: close()
    TaskEnv->>Client: orch.close()
    Client->>Fleet: terminate instance
    TaskEnv->>TaskEnv: fleet_rollout_completed (telemetry)

_{Last reviewed commit: 20bb13e}

greptile-apps · 2026-03-04T03:05:33Z

src/pyproject.toml


+# Fleet runtime integration (optional)
+fleet = [
+    "mcp>=1.0.0",
+    "fleet-python>=0.2.79",


logfire missing from src/pyproject.toml fleet extras — runtime import will fail

telemetry.py unconditionally imports logfire at module level:

import logfire

The root pyproject.toml correctly includes logfire>=3.0.0 in the fleet extras, but src/pyproject.toml does not — it includes openai>=2.11.0 instead (which is only imported under TYPE_CHECKING so it's optional at runtime). Any consumer installing via the src/ package manifest will get an ImportError: No module named 'logfire' the moment they import anything from envs.fleet_env.

Suggested change

# Fleet runtime integration (optional)

fleet = [

"mcp>=1.0.0",

"fleet-python>=0.2.79",

# Fleet runtime integration (optional)

fleet = [

"mcp>=1.0.0",

"fleet-python>=0.2.79",

"logfire>=3.0.0",

"openai>=2.11.0",

]

Prompt To Fix With AI

This is a comment left during a code review. Path: src/pyproject.toml Line: 37-41 Comment: **`logfire` missing from `src/pyproject.toml` fleet extras — runtime import will fail** `telemetry.py` unconditionally imports `logfire` at module level: ```python import logfire ``` The root `pyproject.toml` correctly includes `logfire>=3.0.0` in the `fleet` extras, but `src/pyproject.toml` does not — it includes `openai>=2.11.0` instead (which is only imported under `TYPE_CHECKING` so it's optional at runtime). Any consumer installing via the `src/` package manifest will get an `ImportError: No module named 'logfire'` the moment they import anything from `envs.fleet_env`. ```suggestion # Fleet runtime integration (optional) fleet = [ "mcp>=1.0.0", "fleet-python>=0.2.79", "logfire>=3.0.0", "openai>=2.11.0", ] ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-04T03:05:46Z

Additional Comments (3)

src/envs/fleet_env/mcp_tools.py
initial_wait applies on every list_tools call, not just the first — 8s penalty on every invocation

The initial_wait=8.0 sleep runs unconditionally at the start of list_tools(). Since FleetTaskEnv.reset_async() calls list_tools() at the beginning of every episode, and some callers (e.g., _call_tool_single_attempt when the owner cache is cold) may also call list_tools() on the client directly, each such call incurs a full 8-second delay.

Consider guarding the sleep so it only fires once per FleetMCPTools instance lifetime:

if self.initial_wait > 0 and not self._clients:
    logger.info(f"Waiting {self.initial_wait:.0f}s for MCP services to initialize...")
    await asyncio.sleep(self.initial_wait)

Or add a _initialized: bool = False flag and flip it after the first successful list_tools attempt.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/envs/fleet_env/mcp_tools.py
Line: 1577-1589

Comment:
**`initial_wait` applies on every `list_tools` call, not just the first — 8s penalty on every invocation**

The `initial_wait=8.0` sleep runs unconditionally at the start of `list_tools()`. Since `FleetTaskEnv.reset_async()` calls `list_tools()` at the beginning of every episode, and some callers (e.g., `_call_tool_single_attempt` when the owner cache is cold) may also call `list_tools()` on the client directly, each such call incurs a full 8-second delay.

Consider guarding the sleep so it only fires once per `FleetMCPTools` instance lifetime:

```python
if self.initial_wait > 0 and not self._clients:
    logger.info(f"Waiting {self.initial_wait:.0f}s for MCP services to initialize...")
    await asyncio.sleep(self.initial_wait)
```

Or add a `_initialized: bool = False` flag and flip it after the first successful `list_tools` attempt.

How can I resolve this? If you propose a fix, please make it concise.

src/envs/fleet_env/client.py
Throwaway debug code left in — from_fleet_async is completely broken

The body of the retry loop was replaced with an infinite _tick_forever() coroutine; the actual AsyncFleet.make() call was never restored. The comment says explicitly: "THROWAWAY DEBUG: do NOT call AsyncFleet.make()".

Consequences:

_tick_forever() is an infinite while True loop — from_fleet_async never returns.
FleetTaskEnv._ensure_provisioned() calls from_fleet_async, so every reset() / reset_async() on a FleetTaskEnv hangs forever.
async_env is initialized to None and is never assigned inside the loop. After the debug code is removed, the code immediately after the for block — async_env.urls.root and await async_env.close() — would raise AttributeError: 'NoneType' object has no attribute 'urls'.

The debug block must be removed and replaced with the actual async_env = await async_fleet.make(...) call before this branch can be merged.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/envs/fleet_env/client.py
Line: 830-844

Comment:
**Throwaway debug code left in — `from_fleet_async` is completely broken**

The body of the retry loop was replaced with an infinite `_tick_forever()` coroutine; the actual `AsyncFleet.make()` call was never restored. The comment says explicitly: *"THROWAWAY DEBUG: do NOT call AsyncFleet.make()"*.

Consequences:
1. `_tick_forever()` is an infinite `while True` loop — `from_fleet_async` **never returns**.
2. `FleetTaskEnv._ensure_provisioned()` calls `from_fleet_async`, so every `reset()` / `reset_async()` on a `FleetTaskEnv` hangs forever.
3. `async_env` is initialized to `None` and is never assigned inside the loop. After the debug code is removed, the code immediately after the `for` block — `async_env.urls.root` and `await async_env.close()` — would raise `AttributeError: 'NoneType' object has no attribute 'urls'`.

The debug block must be removed and replaced with the actual `async_env = await async_fleet.make(...)` call before this branch can be merged.

How can I resolve this? If you propose a fix, please make it concise.

tests/envs/test_fleet_env.py
Missing required positional arguments — tests will raise TypeError

FleetEnvClient.from_fleet declares three non-optional parameters without defaults: data_key: str, data_version: str, and image_type: str. The three sync tests in this block all call from_fleet with only two arguments, so Python will raise TypeError: from_fleet() missing 3 required positional arguments before any assertion runs.

To fix, either add default values to those parameters in the production code, or pass the missing arguments explicitly in each test call.

Prompt To Fix With AI

This is a comment left during a code review.
Path: tests/envs/test_fleet_env.py
Line: 2731-2762

Comment:
**Missing required positional arguments — tests will raise `TypeError`**

`FleetEnvClient.from_fleet` declares three non-optional parameters without defaults: `data_key: str`, `data_version: str`, and `image_type: str`. The three sync tests in this block all call `from_fleet` with only two arguments, so Python will raise `TypeError: from_fleet() missing 3 required positional arguments` before any assertion runs.

To fix, either add default values to those parameters in the production code, or pass the missing arguments explicitly in each test call.

How can I resolve this? If you propose a fix, please make it concise.

Deniz and others added 30 commits December 12, 2025 12:14

fleet integartion step 0

38edfc2

updated README

f67bc43

readme update

164853c

another iteraton

935826f

readme

7c09d5b

conb

7efae22

Add __init__.py to envs package for pip install compatibility

791a071

fix: Remove seed parameter from reset() call

7852847

HTTPEnvClient.reset() doesn't support seed parameter yet. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix: fetch tools lazily in reset_async to avoid asyncio.run in async …

ced5eca

…context

debug: add logging for call_tool to trace success/failure paths

a2f3531

fix: unwrap ExceptionGroup to show actual error cause

a08cb6d

debug: add logging for Fleet instance creation timing

9806eb8

Use image_type='mcp' for computer_use tasks

a1ac1a7

Computer-use tasks require MCP-enabled container images (e.g., famazon:mcp0.0.7) which have scrot installed for screenshots and the MCP server with 'computer' tool for mouse/keyboard control.

Deniz and others added 26 commits February 15, 2026 16:42

feat: auto-select instance TTL based on modality

887fd1f

computer_use: 1800s (30 min) — browser + inference is slow tool_use: 600s (10 min) — API calls are fast Can be overridden by passing ttl_seconds explicitly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add Fleet telemetry section to README

540530a

Documents the rollout lifecycle accounting: started = completed + init_err + tools_err + no_computer + max_steps + abandoned Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Increase tool_use TTL from 600s to 900s to reduce 502s from instance …

199f67f

…expiry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

debug: heartbeat around AsyncFleet.make

22eb8f3

debug: simplify make() heartbeat + skip

0c0d15a

debug: remove skip-make mode; disable AsyncFleet.make

aa6ce5d

debug: make disabled; print every 10s

20bb13e

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 4, 2026

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deniz/debug#417

Deniz/debug#417
dzorlu wants to merge 62 commits intometa-pytorch:mainfrom
fleet-ai:deniz/debug

dzorlu commented Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzorlu commented Mar 4, 2026

Summary

Type of Change

Alignment Checklist

RFC Status

Test Plan

Claude Code Review

Uh oh!

greptile-apps bot commented Mar 4, 2026

Greptile Summary

Confidence Score: 0/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant