Skip to content

Deniz/debug#417

Open
dzorlu wants to merge 62 commits intometa-pytorch:mainfrom
fleet-ai:deniz/debug
Open

Deniz/debug#417
dzorlu wants to merge 62 commits intometa-pytorch:mainfrom
fleet-ai:deniz/debug

Conversation

@dzorlu
Copy link

@dzorlu dzorlu commented Mar 4, 2026

Summary

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation
  • New environment
  • Refactoring

Alignment Checklist

Before submitting, verify:

  • I have read .claude/docs/PRINCIPLES.md and this PR aligns with our principles
  • I have checked .claude/docs/INVARIANTS.md and no invariants are violated
  • I have run /pre-submit-pr (or bash .claude/hooks/lint.sh and tests) and addressed all issues

RFC Status

  • Not required (bug fix, docs, minor refactoring)
  • RFC exists: #___
  • RFC needed (will create before merge)

Test Plan

Claude Code Review

Deniz and others added 30 commits December 12, 2025 12:14
- FleetTaskEnv wraps FleetEnvClient with task-oriented interface
- Accepts task configs from export_training_tasks.py
- Creates versioned environments on reset
- Injects task prompt into observations
- Executes verifier for reward computation on episode completion
- Supports both sync and async step methods
- Factory functions: make_fleet_task_env, from_json_file
- Tests: 20 unit tests for init, specs, verifiers, factories

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The MCP images don't exist for all environment versions, causing
FleetVersionNotFoundError when trying to create environments.
Changing the default to None allows the Fleet SDK to use standard
images which are available for all versions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
FleetEnvClient.from_fleet() was not accepting data_key/data_version
parameters, causing them to be passed through **kwargs to HTTPEnvClient
which doesn't accept them.

- Add data_key and data_version as explicit parameters
- Pass them to fleet.make()
- Update task_env.py to pass them separately

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fleet SDK expects data_key in "key:version" format, not as separate
parameters. Updated from_fleet() to combine them before calling
fleet.make().

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
HTTPEnvClient.reset() doesn't support seed parameter yet.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Increases default timeout from 15s to 60s for Fleet API calls.
This prevents timeouts during environment initialization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously reset() did partial work and reset_async() added tool fetching.
Now reset_async() does all the work (including fetching tools) and reset()
is just a sync wrapper that calls it via run_until_complete().

This ensures both methods return identical results including tools.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MCP's call_tool() returns a CallToolResult Pydantic object, not plain text.
This was causing ugly repr strings to be passed to agents like:
  "meta=None content=[TextContent(type='text', text='...')] ..."

Now properly extracts:
- Text content from result.content[].text
- Tries JSON parsing for structured results
- Falls back to structuredContent if available
- Handles isError cases

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests for:
- FleetMCPClient._extract_tool_result():
  - Single text content extraction
  - JSON parsing from text
  - Multiple text contents
  - Error result handling
  - Structured content fallback
  - Empty result handling

- FleetTaskEnv reset:
  - reset_async() returns tools
  - reset() calls reset_async() (sync wrapper)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move fleet.make() and list_tools() into FleetTaskEnv.__init__()
- Tools are now fetched at env creation, not during reset
- reset_async() calls _orch.reset() with error handling, returns cached tools
- Use asyncio.run() for Python 3.13 compatibility
- Update tests for new initialization pattern

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Log task_key and verifier code preview when verifier fails
- Catch syntax errors separately with clear message
- Show which functions were found if 'verify' is missing

Helps debug issues like "Verifier code must define a 'verify' function"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace custom _execute_verifier_local() with Fleet SDK's Task.verify_detailed()
which properly sets up the verifier namespace with:
- Environment type annotation
- Helper functions (normalized_contains, etc.)
- Proper function discovery (not just "verify" function)

This fixes "name 'Environment' is not defined" errors during verifier execution.

Changes:
- _compute_reward: Create Fleet SDK Task and call verify_detailed()
- Support both 'verifier_code' and 'verifier_func' field names
- Add comprehensive logging for debugging
- Remove broken _execute_verifier_local method

Tests:
- Update all verifier tests to mock Fleet SDK Task.verify_detailed()
- Add tests for various edge cases (no verifier, no orch, exceptions)
- Fix fixture to avoid asyncio.run() conflicts with pytest-asyncio

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add retry with exponential backoff (3 attempts, 1s/2s/4s delays)
- Log errors instead of silently swallowing exceptions
- Log warning when some clients fail but others succeed
- Log error after all retries exhausted

This fixes silent failures when MCP connections are flaky, which caused
'no tools found' errors in SkyRL training.
call_tool now retries with exponential backoff (3 attempts, 1s/2s/4s)
on connection errors, similar to list_tools.

ValueError (tool not found) is not retried.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds exponential backoff retry (3 attempts, 2s base delay) around
fleet.make() to handle transient Fleet API errors like health check
failures that can occur during instance provisioning.

Only retries on transient errors (health check, timeout, connection).
Permanent errors are raised immediately.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Toolathlon-style context management tools for long trajectories:
- check_context: Check visible/total turn counts
- manage_context: Drop old turns to free up context space
- search_history: Search all history (including dropped)
- search_tool_output: Search truncated tool output
- view_tool_output: Paginate through truncated output

The ContextManager class can be used by any training framework that
maintains chat_history. It tracks full history and handles truncated
tool outputs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Computer-use tasks require MCP-enabled container images (e.g., famazon:mcp0.0.7)
which have scrot installed for screenshots and the MCP server with 'computer' tool
for mouse/keyboard control.
Previously, tools were only fetched for tool_use modality due to a
restrictive condition. This caused computer_use tasks to fail with
"no tools found in observation" because the computer tool (mouse,
keyboard, screenshot) was never fetched.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When task_modality is computer_use, filter tools to only include
the 'computer' tool. This prevents the model from using API tools
when it should be using mouse/keyboard control.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Deniz and others added 26 commits February 15, 2026 16:42
The manager API (POST /reset) hangs indefinitely on some env images
(e.g. google-maps v0.0.53). Since reset failure is already handled
gracefully (warning + continue), this adds a short dedicated timeout
(default 10s) so the reset fails fast instead of blocking for the
full request_timeout_s (60-120s).

This saves 50-110s per episode during training when the manager API
is unresponsive, while still allowing reset to succeed on healthy envs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add env_key property to FleetTaskEnv
- Prefix all error/warning logs with [env=X] for easy filtering
- Helps identify which environments have infrastructure issues (502s, health checks)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Defense-in-depth against zombie threads: if asyncio cancellation somehow
fails to propagate, HTTP-level timeouts ensure MCP calls fail within
2 minutes instead of hanging forever.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- mcp_tools.py: raise RuntimeError after 3 failed list_tools attempts
  instead of silently returning empty ListToolsAction
- mcp_tools.py: increase retry_base_delay from 1s to 2s
- task_env.py: don't set _tools_fetched=True on failure so next
  reset_async() can retry tool discovery

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Structured observability for fleet env errors (init failures, tool call
failures, MCP timeouts, verifier errors). Adds telemetry.py wrapper and
15 instrumentation sites across task_env.py, client.py, mcp_tools.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, modality

- Add set_task_context() to establish base attributes for all events
- All telemetry events now inherit env_key, env_version, task_key, modality
- Parse env_key:version in client.py to log separately
- Add fleet_rollout_started and fleet_rollout_completed events
- Default environment changed to "training_rollouts"
- Update README with new schema and example SQL query

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tracks MCP server errors (returned in tool results) separately from
Python exceptions:
- fleet_tool_call_failed: Python exception during call_tool()
- fleet_mcp_tool_error: MCP server returned {"error": ...}, {"status": "failed"}, or {"isError": true}

This aligns telemetry with WandB tool_error counting which tracks both
exception-based errors and error patterns in tool results.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… full context

- Move set_task_context() before from_fleet() call in task_env.py
- Remove explicit env_key/env_version from client.py telemetry calls (now from context)
- This ensures fleet_make_failed events include task_key and modality

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused fleet_error import from task_env.py
- Fix _is_tool_error to check truthy values (avoid {"error": null} false positives)
- Make close() exception-safe with try/finally for cleanup
- Emit fleet_rollout_completed on ALL paths (not just verifier success)
- Remove unused _env_name/_env_version parsing in client.py

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The eval step was hanging for 6+ hours because MCP list_tools/call_tool
were not enforcing timeouts during connection establishment.

Changes:
- Add OPERATION_TIMEOUT_S = 60s hard limit using asyncio.wait_for()
- Reduce internal timeouts (30s connect, 60s SSE read)
- Raise TimeoutError with clear message when operations hang

This prevents a single slow/stuck Fleet env from blocking the entire
training run.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fleet.make() is synchronous and can block for ~10 minutes per attempt
when an env has health check failures (e.g., fostgres). With 3 retries
and time.sleep(), this blocks the entire event loop for ~30 minutes,
freezing ALL other async trajectories in the batch.

Changes:
- Add FleetEnvClient.from_fleet_async() using AsyncFleet.make() and
  asyncio.sleep() for retries — yields to event loop while waiting
- Defer provisioning from FleetTaskEnv.__init__() to reset_async() via
  _ensure_provisioned(), so fleet.make() runs in async context
- After async provisioning, get sync env handle via Fleet.instance()
  for close() and verify_detailed() compatibility (fast GET, ~100ms)
- Update tests to match new deferred provisioning pattern

No SkyRL changes needed — it already calls __init__() then reset_async().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add console-visible logger.warning() alongside logfire events for
fleet_mcp_tool_error and fleet_tool_call_failed. Includes:
- env_key:env_version (e.g., amazon:v0.0.12)
- step N/max_steps (e.g., step 3/50)
- tool_name and error message

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- fleet_env_reset_failed: use fleet_warning (no traceback) instead of
  fleet_exception — reset 502s are expected, one-line warning is enough
- fleet_env_close_failed: silence entirely — "Instance already terminated"
  is expected when TTL expires before cleanup
- fleet_env_created: remove logfire console print — logger.info already
  prints the useful "Fleet instance ready in Xs: {id}" line

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…teps

- Move fleet_rollout_started to fire before provisioning so init failures
  (e.g., fostgres health check) are counted in total_rollouts
- Emit fleet_rollout_completed with failure_reason="init_error" on init
  failure, ensuring completed <= total_rollouts invariant
- Add total_steps (SUM of step_count) and init_errors columns to SQL query
- Clarify verifier_errors = code exceptions, not model failures
- Add test for init failure telemetry path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…me_s

Tracks per-instance provisioning latency in Logfire to diagnose Fleet
queue serialization (96 concurrent make() calls get processed at ~1/10s,
causing 10+ min queue delays for the last instance).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Route MCP by modality: computer_use → /api/v1/mcp (aggregator, port 8081),
  tool_use → /mcp (per-env server, port 3003). Eliminates partial failure
  where aggregator timeout silently dropped the computer tool.
- Make data_key, data_version, image_type required args on from_fleet/from_fleet_async.
- Emit fleet_rollout_completed on post-provisioning failures (tools_error,
  computer_tool_missing) — closes telemetry gap where rollouts started but
  never completed in Logfire.
- Match harness retry config: 8s initial wait, 8 retries, exponential backoff
  capped at 5s.
- Fatal failures: list_tools fail/empty and missing computer tool now raise
  instead of being swallowed as warnings.
- Update README with sequence chart, endpoint routing table, failure reasons.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
call_tool() referenced self.retry_base_delay which was removed when
matching harness retry config. Would crash with AttributeError on first
retry. Now uses min(2**attempt, self.max_backoff) like list_tools().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
computer_use: 1800s (30 min) — browser + inference is slow
tool_use: 600s (10 min) — API calls are fast

Can be overridden by passing ttl_seconds explicitly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Every fleet_rollout_started now gets a matching fleet_rollout_completed.
Previously, rollouts that were stopped by the caller (max_turns, context
overflow, job cancellation) never emitted the completed event, creating
gaps in the Logfire dashboard (e.g., 132 started but only 9 completed).

close() now infers stop_reason from state:
- max_steps: step_count >= max_steps
- tool_error: last tool call failed (likely TTL expiry)
- caller_stopped: steps were taken but caller stopped early
- cancelled: rollout started but no steps taken

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tool_error is transitory (rollout continues after a failed tool call),
so it's misleading as a stop reason. Simplified to just two:
- max_steps: step_count >= max_steps
- abandoned: everything else (caller stopped, cancelled, ctx overflow)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents the rollout lifecycle accounting:
started = completed + init_err + tools_err + no_computer + max_steps + abandoned

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…expiry

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 4, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR introduces the fleet_env integration — a client-side adapter that connects OpenEnv to Fleet-hosted remote environments via two separate planes: HTTP orchestration (FleetEnvClient) and MCP agent actions (FleetMCPTools). The design correctly follows the RFC 001 split-plane principle and RFC 003's tool-call action model. It also adds FleetTaskEnv, a Gymnasium-compatible wrapper with full telemetry (Logfire), modality-aware tool filtering, and context management utilities.

However, this branch is not merge-ready. The PR title "Deniz/debug" and an explicit "THROWAWAY DEBUG" comment in the code confirm it was pushed for debugging purposes:

  • Critical bug — from_fleet_async is an infinite loop: The entire body of AsyncFleet.make() was replaced with a _tick_forever() coroutine that prints a log every 10 seconds and never returns. Since FleetTaskEnv._ensure_provisioned() calls this method, every reset() call hangs indefinitely. Additionally, async_env is initialized to None and never assigned, so even after removing the debug block the code that follows would raise AttributeError.
  • Three tests have missing required arguments: from_fleet() requires data_key, data_version, and image_type but the three sync tests omit them, causing TypeError before any assertion runs.
  • src/pyproject.toml is missing logfire: telemetry.py imports logfire unconditionally at module level, but the src/ package manifest's fleet extras don't include it, causing an ImportError at import time for consumers of that manifest.

Alignment flags:

  • ALIGNMENT FLAG: FleetTaskEnv exposes reset() and step() sync wrappers that call asyncio.run() internally. If a caller already has a running event loop (common in async training frameworks like SkyRL), these will raise RuntimeError: This event loop is already running. The async variants reset_async() / step_async() exist and should be the primary interface.

    • Principle at stake: Predictable, composable async API (RFC 001 / RFC 002 abstractions).
    • Suggested reviewer: @darktex
  • ALIGNMENT FLAG: The PR explicitly notes it does not implement the container/provider abstraction from RFC 002 (ContainerProvider, local Docker lifecycle). It is a Fleet-specific client with no pluggable backend. This may be intentional as a stepping stone, but the deviation from RFC 002's architecture should be acknowledged with either a follow-up RFC or a note in rfcs/002-env-spec.md.

    • Principle at stake: RFC 002 container/provider abstraction.
    • Suggested reviewer: @darktex

Confidence Score: 0/5

  • Not safe to merge — the async provisioning path is permanently broken by intentional throwaway debug code that the author acknowledged should be removed.
  • The explicit "THROWAWAY DEBUG" infinite loop in from_fleet_async makes the entire FleetTaskEnv reset path non-functional. Combined with failing tests (missing required args) and a missing logfire dependency, this branch cannot be used in production in its current state and should not be merged until the debug code is removed and the follow-on bugs are fixed.
  • src/envs/fleet_env/client.py (broken from_fleet_async), tests/envs/test_fleet_env.py (failing test signatures), src/pyproject.toml (missing logfire dependency)

Important Files Changed

Filename Overview
src/envs/fleet_env/client.py Contains a critical throwaway debug infinite loop in from_fleet_async that completely disables environment provisioning; async_env is also never assigned, so even after removing the debug block the function would crash.
tests/envs/test_fleet_env.py Three sync tests call FleetEnvClient.from_fleet without the required data_key, data_version, and image_type positional arguments — all three will fail with TypeError before executing any assertions.
src/pyproject.toml Fleet extras are missing logfire, which is imported unconditionally at module level in telemetry.py; the root pyproject.toml correctly includes it but this manifest does not, causing an ImportError for consumers of the src/ package.
src/envs/fleet_env/task_env.py Well-structured Gymnasium-compatible wrapper with solid telemetry, modality-based tool filtering, and graceful degradation for reset/screenshot failures; relies on the broken from_fleet_async so it is non-functional until the debug code is removed.
src/envs/fleet_env/mcp_tools.py Clean retry/backoff logic and tool-owner routing; the unconditional 8-second initial_wait on every list_tools call (not just the first) will introduce significant latency in multi-episode workloads.
src/envs/fleet_env/telemetry.py Thin Logfire wrapper with clean context-variable design; logfire is imported unconditionally at module level so the missing dependency in src/pyproject.toml will break consumers.

Sequence Diagram

sequenceDiagram
    participant SkyRL as SkyRL Generator
    participant TaskEnv as FleetTaskEnv
    participant Client as FleetEnvClient (HTTP)
    participant Tools as FleetMCPTools (MCP)
    participant Fleet as Fleet Runtime

    SkyRL->>TaskEnv: reset()
    TaskEnv->>TaskEnv: fleet_rollout_started (telemetry)
    TaskEnv->>Client: from_fleet_async() [BROKEN: infinite debug loop]
    Note over Client: ❌ _tick_forever() runs forever<br/>AsyncFleet.make() never called<br/>async_env stays None

    alt Provisioning succeeds (after fix)
        Client->>Fleet: AsyncFleet.make(env_key, image_type, ...)
        Fleet-->>Client: env handle + URLs
        Client-->>TaskEnv: (orch, tools)
        TaskEnv->>Client: orch.reset() [HTTP, short timeout]
        TaskEnv->>Tools: tools.list_tools() [8s wait + retries]
        Tools->>Fleet: MCP list_tools
        Fleet-->>Tools: tool list
        Tools-->>TaskEnv: ListToolsAction
        TaskEnv->>TaskEnv: filter tools by modality
        TaskEnv-->>SkyRL: obs {prompt, tools, screenshot?}
    end

    loop Agent turn
        SkyRL->>TaskEnv: step_async(action)
        TaskEnv->>Tools: call_tool(name, args)
        Tools->>Fleet: MCP call_tool
        Fleet-->>Tools: result
        Tools-->>TaskEnv: result
        TaskEnv-->>SkyRL: (obs, reward, done, info)
    end

    SkyRL->>TaskEnv: close()
    TaskEnv->>Client: orch.close()
    Client->>Fleet: terminate instance
    TaskEnv->>TaskEnv: fleet_rollout_completed (telemetry)
Loading

Last reviewed commit: 20bb13e

Comment on lines 37 to +41

# Fleet runtime integration (optional)
fleet = [
"mcp>=1.0.0",
"fleet-python>=0.2.79",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logfire missing from src/pyproject.toml fleet extras — runtime import will fail

telemetry.py unconditionally imports logfire at module level:

import logfire

The root pyproject.toml correctly includes logfire>=3.0.0 in the fleet extras, but src/pyproject.toml does not — it includes openai>=2.11.0 instead (which is only imported under TYPE_CHECKING so it's optional at runtime). Any consumer installing via the src/ package manifest will get an ImportError: No module named 'logfire' the moment they import anything from envs.fleet_env.

Suggested change
# Fleet runtime integration (optional)
fleet = [
"mcp>=1.0.0",
"fleet-python>=0.2.79",
# Fleet runtime integration (optional)
fleet = [
"mcp>=1.0.0",
"fleet-python>=0.2.79",
"logfire>=3.0.0",
"openai>=2.11.0",
]
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/pyproject.toml
Line: 37-41

Comment:
**`logfire` missing from `src/pyproject.toml` fleet extras — runtime import will fail**

`telemetry.py` unconditionally imports `logfire` at module level:
```python
import logfire
```

The root `pyproject.toml` correctly includes `logfire>=3.0.0` in the `fleet` extras, but `src/pyproject.toml` does not — it includes `openai>=2.11.0` instead (which is only imported under `TYPE_CHECKING` so it's optional at runtime). Any consumer installing via the `src/` package manifest will get an `ImportError: No module named 'logfire'` the moment they import anything from `envs.fleet_env`.

```suggestion
# Fleet runtime integration (optional)
fleet = [
    "mcp>=1.0.0",
    "fleet-python>=0.2.79",
    "logfire>=3.0.0",
    "openai>=2.11.0",
]
```

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Additional Comments (3)

src/envs/fleet_env/mcp_tools.py
initial_wait applies on every list_tools call, not just the first — 8s penalty on every invocation

The initial_wait=8.0 sleep runs unconditionally at the start of list_tools(). Since FleetTaskEnv.reset_async() calls list_tools() at the beginning of every episode, and some callers (e.g., _call_tool_single_attempt when the owner cache is cold) may also call list_tools() on the client directly, each such call incurs a full 8-second delay.

Consider guarding the sleep so it only fires once per FleetMCPTools instance lifetime:

if self.initial_wait > 0 and not self._clients:
    logger.info(f"Waiting {self.initial_wait:.0f}s for MCP services to initialize...")
    await asyncio.sleep(self.initial_wait)

Or add a _initialized: bool = False flag and flip it after the first successful list_tools attempt.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/fleet_env/mcp_tools.py
Line: 1577-1589

Comment:
**`initial_wait` applies on every `list_tools` call, not just the first — 8s penalty on every invocation**

The `initial_wait=8.0` sleep runs unconditionally at the start of `list_tools()`. Since `FleetTaskEnv.reset_async()` calls `list_tools()` at the beginning of every episode, and some callers (e.g., `_call_tool_single_attempt` when the owner cache is cold) may also call `list_tools()` on the client directly, each such call incurs a full 8-second delay.

Consider guarding the sleep so it only fires once per `FleetMCPTools` instance lifetime:

```python
if self.initial_wait > 0 and not self._clients:
    logger.info(f"Waiting {self.initial_wait:.0f}s for MCP services to initialize...")
    await asyncio.sleep(self.initial_wait)
```

Or add a `_initialized: bool = False` flag and flip it after the first successful `list_tools` attempt.

How can I resolve this? If you propose a fix, please make it concise.

src/envs/fleet_env/client.py
Throwaway debug code left in — from_fleet_async is completely broken

The body of the retry loop was replaced with an infinite _tick_forever() coroutine; the actual AsyncFleet.make() call was never restored. The comment says explicitly: "THROWAWAY DEBUG: do NOT call AsyncFleet.make()".

Consequences:

  1. _tick_forever() is an infinite while True loop — from_fleet_async never returns.
  2. FleetTaskEnv._ensure_provisioned() calls from_fleet_async, so every reset() / reset_async() on a FleetTaskEnv hangs forever.
  3. async_env is initialized to None and is never assigned inside the loop. After the debug code is removed, the code immediately after the for block — async_env.urls.root and await async_env.close() — would raise AttributeError: 'NoneType' object has no attribute 'urls'.

The debug block must be removed and replaced with the actual async_env = await async_fleet.make(...) call before this branch can be merged.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/envs/fleet_env/client.py
Line: 830-844

Comment:
**Throwaway debug code left in — `from_fleet_async` is completely broken**

The body of the retry loop was replaced with an infinite `_tick_forever()` coroutine; the actual `AsyncFleet.make()` call was never restored. The comment says explicitly: *"THROWAWAY DEBUG: do NOT call AsyncFleet.make()"*.

Consequences:
1. `_tick_forever()` is an infinite `while True` loop — `from_fleet_async` **never returns**.
2. `FleetTaskEnv._ensure_provisioned()` calls `from_fleet_async`, so every `reset()` / `reset_async()` on a `FleetTaskEnv` hangs forever.
3. `async_env` is initialized to `None` and is never assigned inside the loop. After the debug code is removed, the code immediately after the `for` block — `async_env.urls.root` and `await async_env.close()` — would raise `AttributeError: 'NoneType' object has no attribute 'urls'`.

The debug block must be removed and replaced with the actual `async_env = await async_fleet.make(...)` call before this branch can be merged.

How can I resolve this? If you propose a fix, please make it concise.

tests/envs/test_fleet_env.py
Missing required positional arguments — tests will raise TypeError

FleetEnvClient.from_fleet declares three non-optional parameters without defaults: data_key: str, data_version: str, and image_type: str. The three sync tests in this block all call from_fleet with only two arguments, so Python will raise TypeError: from_fleet() missing 3 required positional arguments before any assertion runs.

To fix, either add default values to those parameters in the production code, or pass the missing arguments explicitly in each test call.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/envs/test_fleet_env.py
Line: 2731-2762

Comment:
**Missing required positional arguments — tests will raise `TypeError`**

`FleetEnvClient.from_fleet` declares three non-optional parameters without defaults: `data_key: str`, `data_version: str`, and `image_type: str`. The three sync tests in this block all call `from_fleet` with only two arguments, so Python will raise `TypeError: from_fleet() missing 3 required positional arguments` before any assertion runs.

To fix, either add default values to those parameters in the production code, or pass the missing arguments explicitly in each test call.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant