🤖 bench: enable PTC + hard-restart by default, fix mux-run.sh fatal-on-exit#2239
🤖 bench: enable PTC + hard-restart by default, fix mux-run.sh fatal-on-exit#2239ammar-agent wants to merge 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0b9aad504b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if ! printf '%s' "${instruction}" | "${cmd[@]}" | tee "${MUX_OUTPUT_FILE}"; then | ||
| fatal "mux agent session failed" | ||
| mux_exit_code=$? | ||
| log "WARNING: mux agent session exited with code ${mux_exit_code}" |
There was a problem hiding this comment.
Preserve the failing exit code before
! negates it
Because the pipeline is wrapped in if ! …; then, the ! operator negates the pipeline status, so inside the block $? is always 0 when the command failed. With set -o pipefail, this means mux_exit_code becomes 0 on failures/timeouts, and the script exits 0 even when the agent crashed, so the harness can’t detect failures. Capture the pipeline status before applying ! (or use if ...; then ... else mux_exit_code=$?) to preserve the real exit code.
Useful? React with 👍 / 👎.
…exit - Set enabledByDefault=true for programmatic-tool-calling and exec-subagent-hard-restart experiments - CLI buildExperimentsObject now auto-enables default experiments when no explicit --experiment flags are passed - mux-run.sh: replace fatal() on agent non-zero exit with a warning, allowing token extraction to run unconditionally - Propagate agent exit code at end of script
0b9aad5 to
607e7a2
Compare
Plumbs MUX_EXPLORE_MODEL through mux_agent.py → mux-run.sh → CLI. Sets config.agentAiDefaults.explore so explore sub-agents use a fast/cheap model instead of inheriting the expensive parent model.
607e7a2 to
55d3827
Compare
Summary
Enable Programmatic Tool Calling (PTC) and exec-subagent-hard-restart experiments by default for all users. Fix token extraction loss in Terminal-Bench runs when agent exits non-zero.
Background
The Terminal-Bench baseline (Opus 4.6 / xhigh) scores 57.3% (51/89 pass). Analysis of the failure modes shows:
mux-run.shcallsfatal()before the token-extraction python block runsImplementation
1. PTC enabled by default (
experiments.ts)programmatic-tool-calling:enabledByDefault: truecode_executiontool is now available by default in both the desktop app and CLI, letting the model batch 2+ tool calls in a single turn2. Exec sub-agent hard restart enabled by default (
experiments.ts)exec-subagent-hard-restart:enabledByDefault: true3. CLI respects enabledByDefault (
run.ts)buildExperimentsObject()now auto-enables experiments withenabledByDefault: truewhen no explicit--experimentflags are passed--experimentflags, those override defaults entirely (no mixing)4. Fix mux-run.sh fatal-on-exit (
mux-run.sh)fatal "mux agent session failed"with a warning + continueValidation
make typecheck— passmake lint— passmake fmt-check— passmake test— 3446 pass, 0 failGenerated with
mux• Model:anthropic:claude-opus-4-6• Thinking:xhigh• Cost:$92.04