Conversation
- FleetTaskEnv wraps FleetEnvClient with task-oriented interface - Accepts task configs from export_training_tasks.py - Creates versioned environments on reset - Injects task prompt into observations - Executes verifier for reward computation on episode completion - Supports both sync and async step methods - Factory functions: make_fleet_task_env, from_json_file - Tests: 20 unit tests for init, specs, verifiers, factories 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The MCP images don't exist for all environment versions, causing FleetVersionNotFoundError when trying to create environments. Changing the default to None allows the Fleet SDK to use standard images which are available for all versions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
FleetEnvClient.from_fleet() was not accepting data_key/data_version parameters, causing them to be passed through **kwargs to HTTPEnvClient which doesn't accept them. - Add data_key and data_version as explicit parameters - Pass them to fleet.make() - Update task_env.py to pass them separately 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fleet SDK expects data_key in "key:version" format, not as separate parameters. Updated from_fleet() to combine them before calling fleet.make(). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
HTTPEnvClient.reset() doesn't support seed parameter yet. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Increases default timeout from 15s to 60s for Fleet API calls. This prevents timeouts during environment initialization. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously reset() did partial work and reset_async() added tool fetching. Now reset_async() does all the work (including fetching tools) and reset() is just a sync wrapper that calls it via run_until_complete(). This ensures both methods return identical results including tools. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
MCP's call_tool() returns a CallToolResult Pydantic object, not plain text. This was causing ugly repr strings to be passed to agents like: "meta=None content=[TextContent(type='text', text='...')] ..." Now properly extracts: - Text content from result.content[].text - Tries JSON parsing for structured results - Falls back to structuredContent if available - Handles isError cases 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests for: - FleetMCPClient._extract_tool_result(): - Single text content extraction - JSON parsing from text - Multiple text contents - Error result handling - Structured content fallback - Empty result handling - FleetTaskEnv reset: - reset_async() returns tools - reset() calls reset_async() (sync wrapper) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move fleet.make() and list_tools() into FleetTaskEnv.__init__() - Tools are now fetched at env creation, not during reset - reset_async() calls _orch.reset() with error handling, returns cached tools - Use asyncio.run() for Python 3.13 compatibility - Update tests for new initialization pattern 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Log task_key and verifier code preview when verifier fails - Catch syntax errors separately with clear message - Show which functions were found if 'verify' is missing Helps debug issues like "Verifier code must define a 'verify' function" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace custom _execute_verifier_local() with Fleet SDK's Task.verify_detailed() which properly sets up the verifier namespace with: - Environment type annotation - Helper functions (normalized_contains, etc.) - Proper function discovery (not just "verify" function) This fixes "name 'Environment' is not defined" errors during verifier execution. Changes: - _compute_reward: Create Fleet SDK Task and call verify_detailed() - Support both 'verifier_code' and 'verifier_func' field names - Add comprehensive logging for debugging - Remove broken _execute_verifier_local method Tests: - Update all verifier tests to mock Fleet SDK Task.verify_detailed() - Add tests for various edge cases (no verifier, no orch, exceptions) - Fix fixture to avoid asyncio.run() conflicts with pytest-asyncio 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add retry with exponential backoff (3 attempts, 1s/2s/4s delays) - Log errors instead of silently swallowing exceptions - Log warning when some clients fail but others succeed - Log error after all retries exhausted This fixes silent failures when MCP connections are flaky, which caused 'no tools found' errors in SkyRL training.
call_tool now retries with exponential backoff (3 attempts, 1s/2s/4s) on connection errors, similar to list_tools. ValueError (tool not found) is not retried. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds exponential backoff retry (3 attempts, 2s base delay) around fleet.make() to handle transient Fleet API errors like health check failures that can occur during instance provisioning. Only retries on transient errors (health check, timeout, connection). Permanent errors are raised immediately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Toolathlon-style context management tools for long trajectories: - check_context: Check visible/total turn counts - manage_context: Drop old turns to free up context space - search_history: Search all history (including dropped) - search_tool_output: Search truncated tool output - view_tool_output: Paginate through truncated output The ContextManager class can be used by any training framework that maintains chat_history. It tracks full history and handles truncated tool outputs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Computer-use tasks require MCP-enabled container images (e.g., famazon:mcp0.0.7) which have scrot installed for screenshots and the MCP server with 'computer' tool for mouse/keyboard control.
Previously, tools were only fetched for tool_use modality due to a restrictive condition. This caused computer_use tasks to fail with "no tools found in observation" because the computer tool (mouse, keyboard, screenshot) was never fetched. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When task_modality is computer_use, filter tools to only include the 'computer' tool. This prevents the model from using API tools when it should be using mouse/keyboard control. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The manager API (POST /reset) hangs indefinitely on some env images (e.g. google-maps v0.0.53). Since reset failure is already handled gracefully (warning + continue), this adds a short dedicated timeout (default 10s) so the reset fails fast instead of blocking for the full request_timeout_s (60-120s). This saves 50-110s per episode during training when the manager API is unresponsive, while still allowing reset to succeed on healthy envs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add env_key property to FleetTaskEnv - Prefix all error/warning logs with [env=X] for easy filtering - Helps identify which environments have infrastructure issues (502s, health checks) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Defense-in-depth against zombie threads: if asyncio cancellation somehow fails to propagate, HTTP-level timeouts ensure MCP calls fail within 2 minutes instead of hanging forever. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- mcp_tools.py: raise RuntimeError after 3 failed list_tools attempts instead of silently returning empty ListToolsAction - mcp_tools.py: increase retry_base_delay from 1s to 2s - task_env.py: don't set _tools_fetched=True on failure so next reset_async() can retry tool discovery Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Structured observability for fleet env errors (init failures, tool call failures, MCP timeouts, verifier errors). Adds telemetry.py wrapper and 15 instrumentation sites across task_env.py, client.py, mcp_tools.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, modality - Add set_task_context() to establish base attributes for all events - All telemetry events now inherit env_key, env_version, task_key, modality - Parse env_key:version in client.py to log separately - Add fleet_rollout_started and fleet_rollout_completed events - Default environment changed to "training_rollouts" - Update README with new schema and example SQL query 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tracks MCP server errors (returned in tool results) separately from
Python exceptions:
- fleet_tool_call_failed: Python exception during call_tool()
- fleet_mcp_tool_error: MCP server returned {"error": ...}, {"status": "failed"}, or {"isError": true}
This aligns telemetry with WandB tool_error counting which tracks both
exception-based errors and error patterns in tool results.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… full context - Move set_task_context() before from_fleet() call in task_env.py - Remove explicit env_key/env_version from client.py telemetry calls (now from context) - This ensures fleet_make_failed events include task_key and modality 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused fleet_error import from task_env.py
- Fix _is_tool_error to check truthy values (avoid {"error": null} false positives)
- Make close() exception-safe with try/finally for cleanup
- Emit fleet_rollout_completed on ALL paths (not just verifier success)
- Remove unused _env_name/_env_version parsing in client.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The eval step was hanging for 6+ hours because MCP list_tools/call_tool were not enforcing timeouts during connection establishment. Changes: - Add OPERATION_TIMEOUT_S = 60s hard limit using asyncio.wait_for() - Reduce internal timeouts (30s connect, 60s SSE read) - Raise TimeoutError with clear message when operations hang This prevents a single slow/stuck Fleet env from blocking the entire training run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fleet.make() is synchronous and can block for ~10 minutes per attempt when an env has health check failures (e.g., fostgres). With 3 retries and time.sleep(), this blocks the entire event loop for ~30 minutes, freezing ALL other async trajectories in the batch. Changes: - Add FleetEnvClient.from_fleet_async() using AsyncFleet.make() and asyncio.sleep() for retries — yields to event loop while waiting - Defer provisioning from FleetTaskEnv.__init__() to reset_async() via _ensure_provisioned(), so fleet.make() runs in async context - After async provisioning, get sync env handle via Fleet.instance() for close() and verify_detailed() compatibility (fast GET, ~100ms) - Update tests to match new deferred provisioning pattern No SkyRL changes needed — it already calls __init__() then reset_async(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add console-visible logger.warning() alongside logfire events for fleet_mcp_tool_error and fleet_tool_call_failed. Includes: - env_key:env_version (e.g., amazon:v0.0.12) - step N/max_steps (e.g., step 3/50) - tool_name and error message Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- fleet_env_reset_failed: use fleet_warning (no traceback) instead of
fleet_exception — reset 502s are expected, one-line warning is enough
- fleet_env_close_failed: silence entirely — "Instance already terminated"
is expected when TTL expires before cleanup
- fleet_env_created: remove logfire console print — logger.info already
prints the useful "Fleet instance ready in Xs: {id}" line
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…teps - Move fleet_rollout_started to fire before provisioning so init failures (e.g., fostgres health check) are counted in total_rollouts - Emit fleet_rollout_completed with failure_reason="init_error" on init failure, ensuring completed <= total_rollouts invariant - Add total_steps (SUM of step_count) and init_errors columns to SQL query - Clarify verifier_errors = code exceptions, not model failures - Add test for init failure telemetry path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…me_s Tracks per-instance provisioning latency in Logfire to diagnose Fleet queue serialization (96 concurrent make() calls get processed at ~1/10s, causing 10+ min queue delays for the last instance). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Route MCP by modality: computer_use → /api/v1/mcp (aggregator, port 8081), tool_use → /mcp (per-env server, port 3003). Eliminates partial failure where aggregator timeout silently dropped the computer tool. - Make data_key, data_version, image_type required args on from_fleet/from_fleet_async. - Emit fleet_rollout_completed on post-provisioning failures (tools_error, computer_tool_missing) — closes telemetry gap where rollouts started but never completed in Logfire. - Match harness retry config: 8s initial wait, 8 retries, exponential backoff capped at 5s. - Fatal failures: list_tools fail/empty and missing computer tool now raise instead of being swallowed as warnings. - Update README with sequence chart, endpoint routing table, failure reasons. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
call_tool() referenced self.retry_base_delay which was removed when matching harness retry config. Would crash with AttributeError on first retry. Now uses min(2**attempt, self.max_backoff) like list_tools(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
computer_use: 1800s (30 min) — browser + inference is slow tool_use: 600s (10 min) — API calls are fast Can be overridden by passing ttl_seconds explicitly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Every fleet_rollout_started now gets a matching fleet_rollout_completed. Previously, rollouts that were stopped by the caller (max_turns, context overflow, job cancellation) never emitted the completed event, creating gaps in the Logfire dashboard (e.g., 132 started but only 9 completed). close() now infers stop_reason from state: - max_steps: step_count >= max_steps - tool_error: last tool call failed (likely TTL expiry) - caller_stopped: steps were taken but caller stopped early - cancelled: rollout started but no steps taken Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tool_error is transitory (rollout continues after a failed tool call), so it's misleading as a stop reason. Simplified to just two: - max_steps: step_count >= max_steps - abandoned: everything else (caller stopped, cancelled, ctx overflow) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents the rollout lifecycle accounting: started = completed + init_err + tools_err + no_computer + max_steps + abandoned Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…expiry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR introduces the However, this branch is not merge-ready. The PR title "Deniz/debug" and an explicit "THROWAWAY DEBUG" comment in the code confirm it was pushed for debugging purposes:
Alignment flags:
Confidence Score: 0/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant SkyRL as SkyRL Generator
participant TaskEnv as FleetTaskEnv
participant Client as FleetEnvClient (HTTP)
participant Tools as FleetMCPTools (MCP)
participant Fleet as Fleet Runtime
SkyRL->>TaskEnv: reset()
TaskEnv->>TaskEnv: fleet_rollout_started (telemetry)
TaskEnv->>Client: from_fleet_async() [BROKEN: infinite debug loop]
Note over Client: ❌ _tick_forever() runs forever<br/>AsyncFleet.make() never called<br/>async_env stays None
alt Provisioning succeeds (after fix)
Client->>Fleet: AsyncFleet.make(env_key, image_type, ...)
Fleet-->>Client: env handle + URLs
Client-->>TaskEnv: (orch, tools)
TaskEnv->>Client: orch.reset() [HTTP, short timeout]
TaskEnv->>Tools: tools.list_tools() [8s wait + retries]
Tools->>Fleet: MCP list_tools
Fleet-->>Tools: tool list
Tools-->>TaskEnv: ListToolsAction
TaskEnv->>TaskEnv: filter tools by modality
TaskEnv-->>SkyRL: obs {prompt, tools, screenshot?}
end
loop Agent turn
SkyRL->>TaskEnv: step_async(action)
TaskEnv->>Tools: call_tool(name, args)
Tools->>Fleet: MCP call_tool
Fleet-->>Tools: result
Tools-->>TaskEnv: result
TaskEnv-->>SkyRL: (obs, reward, done, info)
end
SkyRL->>TaskEnv: close()
TaskEnv->>Client: orch.close()
Client->>Fleet: terminate instance
TaskEnv->>TaskEnv: fleet_rollout_completed (telemetry)
Last reviewed commit: 20bb13e |
|
|
||
| # Fleet runtime integration (optional) | ||
| fleet = [ | ||
| "mcp>=1.0.0", | ||
| "fleet-python>=0.2.79", |
There was a problem hiding this comment.
logfire missing from src/pyproject.toml fleet extras — runtime import will fail
telemetry.py unconditionally imports logfire at module level:
import logfireThe root pyproject.toml correctly includes logfire>=3.0.0 in the fleet extras, but src/pyproject.toml does not — it includes openai>=2.11.0 instead (which is only imported under TYPE_CHECKING so it's optional at runtime). Any consumer installing via the src/ package manifest will get an ImportError: No module named 'logfire' the moment they import anything from envs.fleet_env.
| # Fleet runtime integration (optional) | |
| fleet = [ | |
| "mcp>=1.0.0", | |
| "fleet-python>=0.2.79", | |
| # Fleet runtime integration (optional) | |
| fleet = [ | |
| "mcp>=1.0.0", | |
| "fleet-python>=0.2.79", | |
| "logfire>=3.0.0", | |
| "openai>=2.11.0", | |
| ] |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/pyproject.toml
Line: 37-41
Comment:
**`logfire` missing from `src/pyproject.toml` fleet extras — runtime import will fail**
`telemetry.py` unconditionally imports `logfire` at module level:
```python
import logfire
```
The root `pyproject.toml` correctly includes `logfire>=3.0.0` in the `fleet` extras, but `src/pyproject.toml` does not — it includes `openai>=2.11.0` instead (which is only imported under `TYPE_CHECKING` so it's optional at runtime). Any consumer installing via the `src/` package manifest will get an `ImportError: No module named 'logfire'` the moment they import anything from `envs.fleet_env`.
```suggestion
# Fleet runtime integration (optional)
fleet = [
"mcp>=1.0.0",
"fleet-python>=0.2.79",
"logfire>=3.0.0",
"openai>=2.11.0",
]
```
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (3)
The Consider guarding the sleep so it only fires once per if self.initial_wait > 0 and not self._clients:
logger.info(f"Waiting {self.initial_wait:.0f}s for MCP services to initialize...")
await asyncio.sleep(self.initial_wait)Or add a Prompt To Fix With AIThis is a comment left during a code review.
Path: src/envs/fleet_env/mcp_tools.py
Line: 1577-1589
Comment:
**`initial_wait` applies on every `list_tools` call, not just the first — 8s penalty on every invocation**
The `initial_wait=8.0` sleep runs unconditionally at the start of `list_tools()`. Since `FleetTaskEnv.reset_async()` calls `list_tools()` at the beginning of every episode, and some callers (e.g., `_call_tool_single_attempt` when the owner cache is cold) may also call `list_tools()` on the client directly, each such call incurs a full 8-second delay.
Consider guarding the sleep so it only fires once per `FleetMCPTools` instance lifetime:
```python
if self.initial_wait > 0 and not self._clients:
logger.info(f"Waiting {self.initial_wait:.0f}s for MCP services to initialize...")
await asyncio.sleep(self.initial_wait)
```
Or add a `_initialized: bool = False` flag and flip it after the first successful `list_tools` attempt.
How can I resolve this? If you propose a fix, please make it concise.
The body of the retry loop was replaced with an infinite Consequences:
The debug block must be removed and replaced with the actual Prompt To Fix With AIThis is a comment left during a code review.
Path: src/envs/fleet_env/client.py
Line: 830-844
Comment:
**Throwaway debug code left in — `from_fleet_async` is completely broken**
The body of the retry loop was replaced with an infinite `_tick_forever()` coroutine; the actual `AsyncFleet.make()` call was never restored. The comment says explicitly: *"THROWAWAY DEBUG: do NOT call AsyncFleet.make()"*.
Consequences:
1. `_tick_forever()` is an infinite `while True` loop — `from_fleet_async` **never returns**.
2. `FleetTaskEnv._ensure_provisioned()` calls `from_fleet_async`, so every `reset()` / `reset_async()` on a `FleetTaskEnv` hangs forever.
3. `async_env` is initialized to `None` and is never assigned inside the loop. After the debug code is removed, the code immediately after the `for` block — `async_env.urls.root` and `await async_env.close()` — would raise `AttributeError: 'NoneType' object has no attribute 'urls'`.
The debug block must be removed and replaced with the actual `async_env = await async_fleet.make(...)` call before this branch can be merged.
How can I resolve this? If you propose a fix, please make it concise.
To fix, either add default values to those parameters in the production code, or pass the missing arguments explicitly in each test call. Prompt To Fix With AIThis is a comment left during a code review.
Path: tests/envs/test_fleet_env.py
Line: 2731-2762
Comment:
**Missing required positional arguments — tests will raise `TypeError`**
`FleetEnvClient.from_fleet` declares three non-optional parameters without defaults: `data_key: str`, `data_version: str`, and `image_type: str`. The three sync tests in this block all call `from_fleet` with only two arguments, so Python will raise `TypeError: from_fleet() missing 3 required positional arguments` before any assertion runs.
To fix, either add default values to those parameters in the production code, or pass the missing arguments explicitly in each test call.
How can I resolve this? If you propose a fix, please make it concise. |
Summary
Type of Change
Alignment Checklist
Before submitting, verify:
.claude/docs/PRINCIPLES.mdand this PR aligns with our principles.claude/docs/INVARIANTS.mdand no invariants are violated/pre-submit-pr(orbash .claude/hooks/lint.shand tests) and addressed all issuesRFC Status
Test Plan
Claude Code Review