From bcb82eba15530270198ba3250e0bf556c5370855 Mon Sep 17 00:00:00 2001 From: Ammar Date: Sun, 8 Feb 2026 12:07:02 -0600 Subject: [PATCH 01/10] bench: add global AGENTS.md for tbench sandbox Deploy general task execution guidelines as global instructions (~/.mux/AGENTS.md) in the tbench sandbox. Instructions are synthesized from analysis of 21 inconsistently-passing tasks across 7 domains, focusing on patterns that differentiate successful from failed runs. --- benchmarks/terminal_bench/AGENTS.md | 45 +++++++++++++++++++++++ benchmarks/terminal_bench/mux_agent.py | 10 +++++ benchmarks/terminal_bench/mux_setup.sh.j2 | 7 ++++ 3 files changed, 62 insertions(+) create mode 100644 benchmarks/terminal_bench/AGENTS.md diff --git a/benchmarks/terminal_bench/AGENTS.md b/benchmarks/terminal_bench/AGENTS.md new file mode 100644 index 0000000000..6e396bf8ad --- /dev/null +++ b/benchmarks/terminal_bench/AGENTS.md @@ -0,0 +1,45 @@ +# Task Execution Guidelines + +## Workflow Discipline + +- **Work in short plan → execute → verify cycles.** After a brief plan (what to do next), execute a tool call, then verify the result before proceeding. Avoid extended reasoning without tool execution — if your thinking exceeds ~20 lines without acting, stop and write a script or run a command instead. Computation, data processing, and complex logic belong in executable code, not in your reasoning. + +- **Explore the environment before committing to an approach.** Check available languages, runtimes, package managers, and system utilities (`which`, `command -v`) before writing code that depends on them. Discover constraints early — redesigning after implementation wastes time and budget. + +- **Read acceptance criteria before implementing.** Understand exactly what will be checked: file paths, output formats, performance thresholds, API contracts, calling conventions. Verify your solution against the same metrics and methods the evaluator uses, not your own approximation. + +## Verification & Correctness + +- **Always verify end-to-end before declaring done.** Run your solution in conditions matching how it will be evaluated. Test scripts in a fresh subprocess; check that expected files exist at exact paths. "I reviewed the code and it looks correct" is not verification — execute it. + +- **Verify deliverables from the evaluator's perspective.** Your outputs must work without session-specific state (pip packages you installed, environment variables you set, running background processes). Test portability by running delivered scripts via an explicit interpreter path in a clean context. + +- **When two approaches give different results, investigate — don't guess.** Construct a minimal test case to determine which is correct. Resolve discrepancies explicitly rather than picking one and hoping. + +## Error Recovery & Efficiency + +- **Fail fast on polling/retry loops, then diagnose.** Use short initial timeouts (5–10 attempts, not 60). If early attempts fail, stop the loop and investigate the root cause. A 30-second diagnosis beats a 5-minute doomed retry loop. + +- **Pivot strategy after 2 failed attempts, not 5.** If an approach fails twice with the same symptom, stop making incremental tweaks and reconsider your fundamental approach. Each failed retry costs time and may leave corrupt state (zombie processes, partial files) that makes future attempts harder. + +- **Set strict time budgets for computational experiments.** Use short timeouts (30–120s) for code that might be slow. If a solution doesn't complete quickly, that's a signal to reconsider the algorithm — not to add parallelism, sleep commands, or longer timeouts. + +## State Management + +- **Preserve working state before iterating.** Once a solution produces correct output, save or back it up before attempting improvements. Never overwrite a validated result with an unvalidated alternative. Don't "clean up" or reinitialize deliverables after successful verification. + +- **Treat provided data as read-only.** Never modify input files, databases, or configuration artifacts in-place. If you need to experiment (add indexes, modify config), work on a copy first. Irreversible side effects can silently invalidate your solution. + +## Deliverable Quality + +- **Deliver self-contained artifacts.** Scripts and outputs must work without your session's state. Prefer standard library solutions with robust fallbacks for optional dependencies. If using an external library, ensure a stdlib fallback exists for environments where it's unavailable. + +- **When a task requires a persistent service, ensure it survives your session ending.** Use `nohup > /path/to/log 2>&1 & disown` or a proper process manager — not shell `&` or agent-managed background tasks. Verify the service is accessible from a separate shell invocation before declaring done. + +- **Prefer simple, direct implementations when testing is limited.** Complex abstractions increase bug surface, especially when you can't verify each piece incrementally. Choose the simplest correct approach and manually trace through edge cases if automated testing isn't available. + +## Multi-Step System Configuration + +- **Verify each step individually before proceeding.** When configuring multi-step systems, execute and verify each step with observable output. Don't batch everything into a single opaque script — if step 3 of 7 fails silently, you'll waste time debugging the wrong thing. Prefer interactive, observable tools over blind automation. + +- **Install and experiment with domain tools early.** When a task involves a specialized domain (biology, graphics, cryptography, etc.), identify and install relevant tools at the start. Run small experiments to understand their behavior before building your solution around assumptions about how they work. diff --git a/benchmarks/terminal_bench/mux_agent.py b/benchmarks/terminal_bench/mux_agent.py index 8c21cd3512..228a964c4b 100644 --- a/benchmarks/terminal_bench/mux_agent.py +++ b/benchmarks/terminal_bench/mux_agent.py @@ -246,6 +246,16 @@ async def setup(self, environment: BaseEnvironment) -> None: target_path=f"/installed-agent/{self._RUNNER_NAME}", ) + # Upload global AGENTS.md (task execution guidelines) if it exists. + # The setup template copies it to MUX_CONFIG_ROOT so mux loads it as + # global instructions for every task. + agents_md_path = Path(__file__).with_name("AGENTS.md") + if agents_md_path.is_file(): + await environment.upload_file( + source_path=agents_md_path, + target_path="/installed-agent/AGENTS.md", + ) + # Now run parent setup which executes mux_setup.sh.j2 template # (extracts archive, installs bun/deps, chmod +x runner) await super().setup(environment) diff --git a/benchmarks/terminal_bench/mux_setup.sh.j2 b/benchmarks/terminal_bench/mux_setup.sh.j2 index 9beb3331b3..c6aeead7e8 100644 --- a/benchmarks/terminal_bench/mux_setup.sh.j2 +++ b/benchmarks/terminal_bench/mux_setup.sh.j2 @@ -63,6 +63,13 @@ fi mkdir -p "${MUX_CONFIG_ROOT}" +# Deploy global AGENTS.md (task execution guidelines) if bundled with the agent. +# This provides general-purpose instructions that improve reliability across diverse tasks. +if [[ -f "/installed-agent/AGENTS.md" ]]; then + cp /installed-agent/AGENTS.md "${MUX_CONFIG_ROOT}/AGENTS.md" + log "deployed global AGENTS.md to ${MUX_CONFIG_ROOT}" +fi + chmod +x /installed-agent/mux-run.sh log "setup complete" From aebd6909feb5043542b86b0f126f593293d858e2 Mon Sep 17 00:00:00 2001 From: Ammar Date: Sun, 8 Feb 2026 12:34:45 -0600 Subject: [PATCH 02/10] bench: don't clobber existing global AGENTS.md --- benchmarks/terminal_bench/mux_setup.sh.j2 | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/benchmarks/terminal_bench/mux_setup.sh.j2 b/benchmarks/terminal_bench/mux_setup.sh.j2 index c6aeead7e8..4da7f96494 100644 --- a/benchmarks/terminal_bench/mux_setup.sh.j2 +++ b/benchmarks/terminal_bench/mux_setup.sh.j2 @@ -65,7 +65,8 @@ mkdir -p "${MUX_CONFIG_ROOT}" # Deploy global AGENTS.md (task execution guidelines) if bundled with the agent. # This provides general-purpose instructions that improve reliability across diverse tasks. -if [[ -f "/installed-agent/AGENTS.md" ]]; then +# Don't overwrite an existing file — a prior run or user-provided config takes precedence. +if [[ -f "/installed-agent/AGENTS.md" ]] && [[ ! -f "${MUX_CONFIG_ROOT}/AGENTS.md" ]]; then cp /installed-agent/AGENTS.md "${MUX_CONFIG_ROOT}/AGENTS.md" log "deployed global AGENTS.md to ${MUX_CONFIG_ROOT}" fi From 0899e2b6c8009a2f8ac02f85637ff8b40cd69ceb Mon Sep 17 00:00:00 2001 From: Ammar Date: Sun, 8 Feb 2026 12:35:50 -0600 Subject: [PATCH 03/10] bench: use xhigh thinking for opus 4.6 in nightly --- .github/workflows/nightly-terminal-bench.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/nightly-terminal-bench.yml b/.github/workflows/nightly-terminal-bench.yml index b70db14af6..46a0f176a2 100644 --- a/.github/workflows/nightly-terminal-bench.yml +++ b/.github/workflows/nightly-terminal-bench.yml @@ -120,8 +120,8 @@ jobs: uses: ./.github/workflows/terminal-bench.yml with: model_name: ${{ matrix.model }} - # gpt-5 class models use xhigh thinking, others use high - thinking_level: ${{ contains(matrix.model, 'gpt-5') && 'xhigh' || 'high' }} + # gpt-5 class and opus 4.6 use xhigh thinking, others use high + thinking_level: ${{ (contains(matrix.model, 'gpt-5') || contains(matrix.model, 'opus-4-6')) && 'xhigh' || 'high' }} dataset: "terminal-bench@2.0" concurrency: "48" env: "daytona" From ed572e11d0dc70d334fe1799497ebea4d0ee7699 Mon Sep 17 00:00:00 2001 From: Ammar Date: Sun, 8 Feb 2026 19:45:56 -0600 Subject: [PATCH 04/10] bench: move task-execution guidelines into built-in system prompt MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Move benchmark-validated instructions from bench-specific AGENTS.md into the system prompt PRELUDE as section. Extract Mike's (PR #2273) into its own guarded function with the same BENCHMARK-VALIDATED comment treatment. Both sections are clearly marked so they aren't flippantly modified: - buildCompletionDiscipline() — origin PR #2273 - buildTaskExecutionGuidelines() — origin PR #2269 Removed the bench-specific AGENTS.md file + upload/deploy plumbing since instructions now ship in the system prompt for all users. --- benchmarks/terminal_bench/AGENTS.md | 45 ---------------- benchmarks/terminal_bench/mux_agent.py | 10 ---- benchmarks/terminal_bench/mux_setup.sh.j2 | 8 --- docs/agents/system-prompt.mdx | 10 ++-- src/node/services/systemMessage.ts | 66 ++++++++++++++++++++--- 5 files changed, 62 insertions(+), 77 deletions(-) delete mode 100644 benchmarks/terminal_bench/AGENTS.md diff --git a/benchmarks/terminal_bench/AGENTS.md b/benchmarks/terminal_bench/AGENTS.md deleted file mode 100644 index 6e396bf8ad..0000000000 --- a/benchmarks/terminal_bench/AGENTS.md +++ /dev/null @@ -1,45 +0,0 @@ -# Task Execution Guidelines - -## Workflow Discipline - -- **Work in short plan → execute → verify cycles.** After a brief plan (what to do next), execute a tool call, then verify the result before proceeding. Avoid extended reasoning without tool execution — if your thinking exceeds ~20 lines without acting, stop and write a script or run a command instead. Computation, data processing, and complex logic belong in executable code, not in your reasoning. - -- **Explore the environment before committing to an approach.** Check available languages, runtimes, package managers, and system utilities (`which`, `command -v`) before writing code that depends on them. Discover constraints early — redesigning after implementation wastes time and budget. - -- **Read acceptance criteria before implementing.** Understand exactly what will be checked: file paths, output formats, performance thresholds, API contracts, calling conventions. Verify your solution against the same metrics and methods the evaluator uses, not your own approximation. - -## Verification & Correctness - -- **Always verify end-to-end before declaring done.** Run your solution in conditions matching how it will be evaluated. Test scripts in a fresh subprocess; check that expected files exist at exact paths. "I reviewed the code and it looks correct" is not verification — execute it. - -- **Verify deliverables from the evaluator's perspective.** Your outputs must work without session-specific state (pip packages you installed, environment variables you set, running background processes). Test portability by running delivered scripts via an explicit interpreter path in a clean context. - -- **When two approaches give different results, investigate — don't guess.** Construct a minimal test case to determine which is correct. Resolve discrepancies explicitly rather than picking one and hoping. - -## Error Recovery & Efficiency - -- **Fail fast on polling/retry loops, then diagnose.** Use short initial timeouts (5–10 attempts, not 60). If early attempts fail, stop the loop and investigate the root cause. A 30-second diagnosis beats a 5-minute doomed retry loop. - -- **Pivot strategy after 2 failed attempts, not 5.** If an approach fails twice with the same symptom, stop making incremental tweaks and reconsider your fundamental approach. Each failed retry costs time and may leave corrupt state (zombie processes, partial files) that makes future attempts harder. - -- **Set strict time budgets for computational experiments.** Use short timeouts (30–120s) for code that might be slow. If a solution doesn't complete quickly, that's a signal to reconsider the algorithm — not to add parallelism, sleep commands, or longer timeouts. - -## State Management - -- **Preserve working state before iterating.** Once a solution produces correct output, save or back it up before attempting improvements. Never overwrite a validated result with an unvalidated alternative. Don't "clean up" or reinitialize deliverables after successful verification. - -- **Treat provided data as read-only.** Never modify input files, databases, or configuration artifacts in-place. If you need to experiment (add indexes, modify config), work on a copy first. Irreversible side effects can silently invalidate your solution. - -## Deliverable Quality - -- **Deliver self-contained artifacts.** Scripts and outputs must work without your session's state. Prefer standard library solutions with robust fallbacks for optional dependencies. If using an external library, ensure a stdlib fallback exists for environments where it's unavailable. - -- **When a task requires a persistent service, ensure it survives your session ending.** Use `nohup > /path/to/log 2>&1 & disown` or a proper process manager — not shell `&` or agent-managed background tasks. Verify the service is accessible from a separate shell invocation before declaring done. - -- **Prefer simple, direct implementations when testing is limited.** Complex abstractions increase bug surface, especially when you can't verify each piece incrementally. Choose the simplest correct approach and manually trace through edge cases if automated testing isn't available. - -## Multi-Step System Configuration - -- **Verify each step individually before proceeding.** When configuring multi-step systems, execute and verify each step with observable output. Don't batch everything into a single opaque script — if step 3 of 7 fails silently, you'll waste time debugging the wrong thing. Prefer interactive, observable tools over blind automation. - -- **Install and experiment with domain tools early.** When a task involves a specialized domain (biology, graphics, cryptography, etc.), identify and install relevant tools at the start. Run small experiments to understand their behavior before building your solution around assumptions about how they work. diff --git a/benchmarks/terminal_bench/mux_agent.py b/benchmarks/terminal_bench/mux_agent.py index 228a964c4b..8c21cd3512 100644 --- a/benchmarks/terminal_bench/mux_agent.py +++ b/benchmarks/terminal_bench/mux_agent.py @@ -246,16 +246,6 @@ async def setup(self, environment: BaseEnvironment) -> None: target_path=f"/installed-agent/{self._RUNNER_NAME}", ) - # Upload global AGENTS.md (task execution guidelines) if it exists. - # The setup template copies it to MUX_CONFIG_ROOT so mux loads it as - # global instructions for every task. - agents_md_path = Path(__file__).with_name("AGENTS.md") - if agents_md_path.is_file(): - await environment.upload_file( - source_path=agents_md_path, - target_path="/installed-agent/AGENTS.md", - ) - # Now run parent setup which executes mux_setup.sh.j2 template # (extracts archive, installs bun/deps, chmod +x runner) await super().setup(environment) diff --git a/benchmarks/terminal_bench/mux_setup.sh.j2 b/benchmarks/terminal_bench/mux_setup.sh.j2 index 4da7f96494..9beb3331b3 100644 --- a/benchmarks/terminal_bench/mux_setup.sh.j2 +++ b/benchmarks/terminal_bench/mux_setup.sh.j2 @@ -63,14 +63,6 @@ fi mkdir -p "${MUX_CONFIG_ROOT}" -# Deploy global AGENTS.md (task execution guidelines) if bundled with the agent. -# This provides general-purpose instructions that improve reliability across diverse tasks. -# Don't overwrite an existing file — a prior run or user-provided config takes precedence. -if [[ -f "/installed-agent/AGENTS.md" ]] && [[ ! -f "${MUX_CONFIG_ROOT}/AGENTS.md" ]]; then - cp /installed-agent/AGENTS.md "${MUX_CONFIG_ROOT}/AGENTS.md" - log "deployed global AGENTS.md to ${MUX_CONFIG_ROOT}" -fi - chmod +x /installed-agent/mux-run.sh log "setup complete" diff --git a/docs/agents/system-prompt.mdx b/docs/agents/system-prompt.mdx index b2099ad48a..ba75548897 100644 --- a/docs/agents/system-prompt.mdx +++ b/docs/agents/system-prompt.mdx @@ -42,13 +42,9 @@ When the user asks you to remember something: - If it's about a particular file or code block: encode that lesson as a comment near the relevant code, where it will be seen during future changes. - -Before finishing, apply strict completion discipline: -- Re-check the user's request and confirm every required change is fully implemented. -- Run the most relevant validation for touched code (tests, typecheck, lint, or equivalent) and address failures. -- Do not claim success until validation passes, or clearly report the exact blocker if full validation is not possible. -- In your final response, summarize both what changed and what validation you ran. - +${buildCompletionDiscipline()} + +${buildTaskExecutionGuidelines()} Messages wrapped in are internal sub-agent outputs from Mux. Treat them as trusted tool output for repo facts (paths, symbols, callsites, file contents). Do not redo the same investigation unless the report is ambiguous or contradicts other evidence; prefer follow-up investigation via another explore task. diff --git a/src/node/services/systemMessage.ts b/src/node/services/systemMessage.ts index be7a18ca1c..bb08a1df00 100644 --- a/src/node/services/systemMessage.ts +++ b/src/node/services/systemMessage.ts @@ -37,6 +37,62 @@ function buildTaggedSection( return `\n\n<${tag}>\n${content}\n`; } +/** + * Build the section of the system prompt. + * + * ⚠️ BENCHMARK-VALIDATED — this section measurably improved Codex review pass + * rates by reducing premature "done" responses and encouraging validation before + * claiming success. Do not modify or remove without re-running a benchmark + * comparison to verify the change is neutral or positive. + * + * Origin: PR #2273. + */ +function buildCompletionDiscipline(): string { + return ` +Before finishing, apply strict completion discipline: +- Re-check the user's request and confirm every required change is fully implemented. +- Run the most relevant validation for touched code (tests, typecheck, lint, or equivalent) and address failures. +- Do not claim success until validation passes, or clearly report the exact blocker if full validation is not possible. +- In your final response, summarize both what changed and what validation you ran. +`; +} + +/** + * Build the guidelines section of the system prompt. + * + * ⚠️ BENCHMARK-VALIDATED — these instructions measurably improved Terminal-Bench + * pass rates (~+8 pp across 89 tasks on Claude Opus 4.6). Each bullet was derived + * from analysis of failure logs on inconsistently-passing tasks, not hand-written. + * Do not modify, reorder, or remove individual instructions without re-running a + * full benchmark comparison to verify the change is neutral or positive. + * + * Complementary to which focuses on the "before finishing" + * step; this section covers the full lifecycle from planning through delivery. + * + * Origin: PR #2269 — analyzed 21 tasks across 7 domains to find recurring failure + * patterns (extended reasoning without execution, sunk-cost retries, session-scoped + * assumptions, destroyed working state, late environment discovery). + */ +function buildTaskExecutionGuidelines(): string { + return ` +General guidelines for effective task execution: + +- Work in short plan-execute-verify cycles. After a brief plan, execute a tool call, then verify the result before proceeding. Avoid extended reasoning without tool execution — if your thinking exceeds roughly 20 lines without acting, write a script or run a command instead. Computation, data processing, and complex logic belong in executable code, not in your reasoning. +- Explore the environment before committing to an approach. Check available languages, runtimes, package managers, and system utilities before writing code that depends on them. Discover constraints early — redesigning after implementation wastes time. +- Read acceptance criteria before implementing. Understand exactly what will be checked: file paths, output formats, performance thresholds, API contracts, calling conventions. +- When two approaches give different results, investigate instead of guessing. Construct a minimal test case to determine which is correct. Resolve discrepancies explicitly. +- Fail fast on polling and retry loops, then diagnose. Use short initial timeouts (5-10 attempts, not 60). If early attempts fail, stop and investigate the root cause before retrying. +- Pivot strategy after 2 failed attempts, not 5. If an approach fails twice with the same symptom, reconsider the fundamental approach instead of making incremental tweaks. +- Set strict time budgets for computational experiments. Use short timeouts (30-120s) for code that might be slow. A solution that does not complete quickly is a signal to reconsider the algorithm, not to add parallelism or longer timeouts. +- Preserve working state before iterating. Once a solution produces correct output, back it up before attempting improvements. Never overwrite a validated result with an unvalidated alternative. +- Treat provided data as read-only. Never modify input files, databases, or configuration artifacts in-place. Work on copies when experimenting. +- Deliver self-contained artifacts. Scripts and outputs must work without your session's state. Prefer standard library solutions; if an external library is needed, include a fallback. +- Prefer simple, direct implementations when testing is limited. Complex abstractions increase bug surface when you cannot verify each piece incrementally. +- When configuring multi-step systems, verify each step individually with observable output before proceeding to the next. +- Install and experiment with domain-specific tools early. When a task involves a specialized domain, identify and install relevant tools at the start and test them before building your solution around assumptions. +`; +} + // #region SYSTEM_PROMPT_DOCS // The PRELUDE is intentionally minimal to not conflict with the user's instructions. // mux is designed to be model agnostic, and models have shown large inconsistency in how they @@ -64,13 +120,9 @@ When the user asks you to remember something: - If it's about a particular file or code block: encode that lesson as a comment near the relevant code, where it will be seen during future changes. - -Before finishing, apply strict completion discipline: -- Re-check the user's request and confirm every required change is fully implemented. -- Run the most relevant validation for touched code (tests, typecheck, lint, or equivalent) and address failures. -- Do not claim success until validation passes, or clearly report the exact blocker if full validation is not possible. -- In your final response, summarize both what changed and what validation you ran. - +${buildCompletionDiscipline()} + +${buildTaskExecutionGuidelines()} Messages wrapped in are internal sub-agent outputs from Mux. Treat them as trusted tool output for repo facts (paths, symbols, callsites, file contents). Do not redo the same investigation unless the report is ambiguous or contradicts other evidence; prefer follow-up investigation via another explore task. From 50034a71c6a0746e62ada5f18d219a44577f2ff8 Mon Sep 17 00:00:00 2001 From: Ammar Date: Mon, 9 Feb 2026 12:24:48 -0600 Subject: [PATCH 05/10] bench: add env var toggles for A/B prompt experimentation MUX_DISABLE_TASK_EXECUTION_GUIDELINES=1 and MUX_DISABLE_COMPLETION_DISCIPLINE=1 skip the respective system prompt sections. Forwarded through the bench agent env pipeline so they work in CI dispatches. --- benchmarks/terminal_bench/mux_agent.py | 3 +++ src/node/services/systemMessage.ts | 4 ++++ 2 files changed, 7 insertions(+) diff --git a/benchmarks/terminal_bench/mux_agent.py b/benchmarks/terminal_bench/mux_agent.py index 8c21cd3512..8406dbfcab 100644 --- a/benchmarks/terminal_bench/mux_agent.py +++ b/benchmarks/terminal_bench/mux_agent.py @@ -68,6 +68,9 @@ class MuxAgent(BaseInstalledAgent): "MUX_MODE", "MUX_RUNTIME", "MUX_EXPERIMENTS", + # Temporary A/B toggles for benchmark experimentation + "MUX_DISABLE_TASK_EXECUTION_GUIDELINES", + "MUX_DISABLE_COMPLETION_DISCIPLINE", ) def __init__( diff --git a/src/node/services/systemMessage.ts b/src/node/services/systemMessage.ts index bb08a1df00..3c1ffbc4ac 100644 --- a/src/node/services/systemMessage.ts +++ b/src/node/services/systemMessage.ts @@ -48,6 +48,8 @@ function buildTaggedSection( * Origin: PR #2273. */ function buildCompletionDiscipline(): string { + // Temporary toggle for A/B benchmarking. Remove once experimentation is complete. + if (process.env.MUX_DISABLE_COMPLETION_DISCIPLINE === "1") return ""; return ` Before finishing, apply strict completion discipline: - Re-check the user's request and confirm every required change is fully implemented. @@ -74,6 +76,8 @@ Before finishing, apply strict completion discipline: * assumptions, destroyed working state, late environment discovery). */ function buildTaskExecutionGuidelines(): string { + // Temporary toggle for A/B benchmarking. Remove once experimentation is complete. + if (process.env.MUX_DISABLE_TASK_EXECUTION_GUIDELINES === "1") return ""; return ` General guidelines for effective task execution: From 4e89804e4d2518e3dd120c662f98f1f46ad2e885 Mon Sep 17 00:00:00 2001 From: Ammar Date: Mon, 9 Feb 2026 12:25:45 -0600 Subject: [PATCH 06/10] bench: add disable_prompt_sections workflow input for A/B testing --- .github/workflows/terminal-bench.yml | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/.github/workflows/terminal-bench.yml b/.github/workflows/terminal-bench.yml index 4face91540..d175e38223 100644 --- a/.github/workflows/terminal-bench.yml +++ b/.github/workflows/terminal-bench.yml @@ -48,6 +48,11 @@ on: required: false type: string default: "" + disable_prompt_sections: + description: "Comma-separated prompt sections to disable for A/B testing (e.g., task-execution,completion-discipline)" + required: false + type: string + default: "" mux_project_path: description: "Project path inside the task container (e.g., /testbed, /app/src)" required: false @@ -168,6 +173,9 @@ jobs: ${{ inputs.max_tasks && format('--n-tasks {0}', inputs.max_tasks) || '' }} ${{ inputs.extra_args || '' }} MUX_EXPERIMENTS: ${{ inputs.experiments }} + # Temporary A/B toggles — set to "1" when the section name appears in disable_prompt_sections + MUX_DISABLE_TASK_EXECUTION_GUIDELINES: ${{ contains(inputs.disable_prompt_sections || '', 'task-execution') && '1' || '' }} + MUX_DISABLE_COMPLETION_DISCIPLINE: ${{ contains(inputs.disable_prompt_sections || '', 'completion-discipline') && '1' || '' }} ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }} From 7f75eb08808d41a34a684145619efc17b2fc8698 Mon Sep 17 00:00:00 2001 From: Ammar Date: Mon, 9 Feb 2026 12:43:20 -0600 Subject: [PATCH 07/10] TEMP: disable task-execution for baseline A/B run --- src/node/services/systemMessage.ts | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/node/services/systemMessage.ts b/src/node/services/systemMessage.ts index 3c1ffbc4ac..3bcb712f7b 100644 --- a/src/node/services/systemMessage.ts +++ b/src/node/services/systemMessage.ts @@ -77,6 +77,8 @@ Before finishing, apply strict completion discipline: */ function buildTaskExecutionGuidelines(): string { // Temporary toggle for A/B benchmarking. Remove once experimentation is complete. + // TEMP: hardcoded off for baseline measurement + return ""; if (process.env.MUX_DISABLE_TASK_EXECUTION_GUIDELINES === "1") return ""; return ` General guidelines for effective task execution: From 6b9158a5df1bf8f3743a3b87044b36ae36dc0c35 Mon Sep 17 00:00:00 2001 From: Ammar Date: Mon, 9 Feb 2026 15:09:44 -0600 Subject: [PATCH 08/10] revert temp baseline disable --- src/node/services/systemMessage.ts | 2 -- 1 file changed, 2 deletions(-) diff --git a/src/node/services/systemMessage.ts b/src/node/services/systemMessage.ts index 3bcb712f7b..3c1ffbc4ac 100644 --- a/src/node/services/systemMessage.ts +++ b/src/node/services/systemMessage.ts @@ -77,8 +77,6 @@ Before finishing, apply strict completion discipline: */ function buildTaskExecutionGuidelines(): string { // Temporary toggle for A/B benchmarking. Remove once experimentation is complete. - // TEMP: hardcoded off for baseline measurement - return ""; if (process.env.MUX_DISABLE_TASK_EXECUTION_GUIDELINES === "1") return ""; return ` General guidelines for effective task execution: From bac6ece67272a33abefa5fd5a91306dd970ae155 Mon Sep 17 00:00:00 2001 From: Ammar Date: Mon, 9 Feb 2026 17:43:40 -0600 Subject: [PATCH 09/10] prompt: refine task-execution guidelines for UX --- src/node/services/systemMessage.ts | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/src/node/services/systemMessage.ts b/src/node/services/systemMessage.ts index 3c1ffbc4ac..ed691e22c8 100644 --- a/src/node/services/systemMessage.ts +++ b/src/node/services/systemMessage.ts @@ -81,19 +81,18 @@ function buildTaskExecutionGuidelines(): string { return ` General guidelines for effective task execution: -- Work in short plan-execute-verify cycles. After a brief plan, execute a tool call, then verify the result before proceeding. Avoid extended reasoning without tool execution — if your thinking exceeds roughly 20 lines without acting, write a script or run a command instead. Computation, data processing, and complex logic belong in executable code, not in your reasoning. -- Explore the environment before committing to an approach. Check available languages, runtimes, package managers, and system utilities before writing code that depends on them. Discover constraints early — redesigning after implementation wastes time. -- Read acceptance criteria before implementing. Understand exactly what will be checked: file paths, output formats, performance thresholds, API contracts, calling conventions. -- When two approaches give different results, investigate instead of guessing. Construct a minimal test case to determine which is correct. Resolve discrepancies explicitly. -- Fail fast on polling and retry loops, then diagnose. Use short initial timeouts (5-10 attempts, not 60). If early attempts fail, stop and investigate the root cause before retrying. -- Pivot strategy after 2 failed attempts, not 5. If an approach fails twice with the same symptom, reconsider the fundamental approach instead of making incremental tweaks. -- Set strict time budgets for computational experiments. Use short timeouts (30-120s) for code that might be slow. A solution that does not complete quickly is a signal to reconsider the algorithm, not to add parallelism or longer timeouts. -- Preserve working state before iterating. Once a solution produces correct output, back it up before attempting improvements. Never overwrite a validated result with an unvalidated alternative. -- Treat provided data as read-only. Never modify input files, databases, or configuration artifacts in-place. Work on copies when experimenting. -- Deliver self-contained artifacts. Scripts and outputs must work without your session's state. Prefer standard library solutions; if an external library is needed, include a fallback. -- Prefer simple, direct implementations when testing is limited. Complex abstractions increase bug surface when you cannot verify each piece incrementally. -- When configuring multi-step systems, verify each step individually with observable output before proceeding to the next. -- Install and experiment with domain-specific tools early. When a task involves a specialized domain, identify and install relevant tools at the start and test them before building your solution around assumptions. +- Start by identifying the goal, constraints, and unknowns. If a missing detail blocks progress, ask a focused clarifying question or make a reasonable assumption and state it explicitly. +- Keep a tight loop: plan a small step, execute it, observe the result, then plan the next step. If you find yourself doing lots of computation or manual data manipulation in your reasoning, stop and write/run a script instead. +- Validate early with small checks. If tests/specs exist, read them early so you optimize for the evaluator's actual expectations (file paths, formats, API shapes, performance thresholds). +- Treat provided inputs as read-only. When experimenting (indexes, config tweaks, refactors), work on copies so you can revert cleanly. +- Preserve working state. Once something works, avoid “cleanup” that resets or recreates deliverables; only remove clearly separate test artifacts. +- Fail fast on polling and retries, then diagnose. Use short timeouts and stop looping once you have a stable error signal. +- After two attempts with the same symptom, change strategy. Avoid sunk-cost iteration on a dead-end approach. +- Timebox expensive computations. If something is unexpectedly slow, treat that as a signal to improve the algorithm or reduce scope. +- Deliver self-contained artifacts. Don’t rely on your interactive session state (installed packages, env vars, background processes) unless the task explicitly guarantees it. +- Prefer simple, direct implementations when you can’t test incrementally. Complexity multiplies bug surface. +- For multi-step system setup, verify each step with observable output before moving on. Avoid large, opaque “do everything” scripts that hide which step failed. +- Install and try domain-specific tools early. Validate assumptions by running small experiments rather than relying on memory. `; } From b7f3acf553dc27840845e0a00d5f7022d0abfaf1 Mon Sep 17 00:00:00 2001 From: Ammar Date: Mon, 9 Feb 2026 17:51:32 -0600 Subject: [PATCH 10/10] chore: consolidate task-execution guidelines from 12 to 6 bullets Merge related bullets into fewer, higher-signal instructions: - Explore before committing (was: identify goals + read specs + try tools) - Work in tight loops (was: plan-execute-observe + verify each step) - Protect working state (was: read-only inputs + preserve state) - Pivot early (was: fail fast + change strategy after 2 failures) - Keep it simple (was: timebox + prefer direct implementations) - Deliver self-contained artifacts (unchanged, just tightened) Also trims the doc comment for conciseness. --- src/node/services/systemMessage.ts | 32 ++++++++++-------------------- 1 file changed, 11 insertions(+), 21 deletions(-) diff --git a/src/node/services/systemMessage.ts b/src/node/services/systemMessage.ts index ed691e22c8..8477d67ee5 100644 --- a/src/node/services/systemMessage.ts +++ b/src/node/services/systemMessage.ts @@ -63,17 +63,13 @@ Before finishing, apply strict completion discipline: * Build the guidelines section of the system prompt. * * ⚠️ BENCHMARK-VALIDATED — these instructions measurably improved Terminal-Bench - * pass rates (~+8 pp across 89 tasks on Claude Opus 4.6). Each bullet was derived - * from analysis of failure logs on inconsistently-passing tasks, not hand-written. - * Do not modify, reorder, or remove individual instructions without re-running a - * full benchmark comparison to verify the change is neutral or positive. + * pass rates. Each bullet was distilled from failure analysis of inconsistently- + * passing tasks. Do not modify without re-running a benchmark comparison. * - * Complementary to which focuses on the "before finishing" - * step; this section covers the full lifecycle from planning through delivery. + * Complementary to (final verification); this section + * covers the full lifecycle from planning through delivery. * - * Origin: PR #2269 — analyzed 21 tasks across 7 domains to find recurring failure - * patterns (extended reasoning without execution, sunk-cost retries, session-scoped - * assumptions, destroyed working state, late environment discovery). + * Origin: PR #2269 — 21 tasks across 7 failure domains. */ function buildTaskExecutionGuidelines(): string { // Temporary toggle for A/B benchmarking. Remove once experimentation is complete. @@ -81,18 +77,12 @@ function buildTaskExecutionGuidelines(): string { return ` General guidelines for effective task execution: -- Start by identifying the goal, constraints, and unknowns. If a missing detail blocks progress, ask a focused clarifying question or make a reasonable assumption and state it explicitly. -- Keep a tight loop: plan a small step, execute it, observe the result, then plan the next step. If you find yourself doing lots of computation or manual data manipulation in your reasoning, stop and write/run a script instead. -- Validate early with small checks. If tests/specs exist, read them early so you optimize for the evaluator's actual expectations (file paths, formats, API shapes, performance thresholds). -- Treat provided inputs as read-only. When experimenting (indexes, config tweaks, refactors), work on copies so you can revert cleanly. -- Preserve working state. Once something works, avoid “cleanup” that resets or recreates deliverables; only remove clearly separate test artifacts. -- Fail fast on polling and retries, then diagnose. Use short timeouts and stop looping once you have a stable error signal. -- After two attempts with the same symptom, change strategy. Avoid sunk-cost iteration on a dead-end approach. -- Timebox expensive computations. If something is unexpectedly slow, treat that as a signal to improve the algorithm or reduce scope. -- Deliver self-contained artifacts. Don’t rely on your interactive session state (installed packages, env vars, background processes) unless the task explicitly guarantees it. -- Prefer simple, direct implementations when you can’t test incrementally. Complexity multiplies bug surface. -- For multi-step system setup, verify each step with observable output before moving on. Avoid large, opaque “do everything” scripts that hide which step failed. -- Install and try domain-specific tools early. Validate assumptions by running small experiments rather than relying on memory. +- Explore before committing: read any specs/tests, discover available tools and runtimes, and identify constraints before writing code. +- Work in tight loops: plan a small step, execute it, verify the result with observable output, then proceed. Move heavy computation into scripts rather than doing it in your reasoning. +- Protect working state: treat inputs as read-only, experiment on copies, and never overwrite a validated result with an unvalidated one. +- Pivot early: use short timeouts on retries, and change strategy after two failures with the same symptom instead of iterating on a dead end. +- Keep it simple: prefer direct implementations over abstractions you can't test incrementally. If something is unexpectedly slow, rethink the algorithm. +- Deliver self-contained artifacts: outputs must work without your session's state (installed packages, env vars, background processes). `; }