Skip to content

Add CLI Agent Benchmark Tasks documentation#182

Open
oddessentials wants to merge 1 commit intomainfrom
claude/plan-cli-benchmark-tasks-bDj8m
Open

Add CLI Agent Benchmark Tasks documentation#182
oddessentials wants to merge 1 commit intomainfrom
claude/plan-cli-benchmark-tasks-bDj8m

Conversation

@oddessentials
Copy link
Owner

Summary

This PR introduces a comprehensive benchmark suite for evaluating AI-backed CLI tools against the Distributed Task Observatory polyglot codebase. The document defines 10 deterministic tasks spanning multiple difficulty levels and capability areas, each with detailed grading rubrics and automated verification criteria.

Key Changes

  • New file: docs/CLI_BENCHMARK_TASKS.md (408 lines)
    • 10 benchmark tasks covering schema evolution, bug detection, dependency management, feature implementation, testing, code review, architecture planning, linting, standardization, and code comprehension
    • Each task includes:
      • Clear prompt given to the AI agent
      • Specific grading rubric with pass/fail criteria
      • Automated verification commands where applicable
      • Time limits and difficulty ratings
    • Scoring summary with 106 total points across all tasks
    • Capability matrix mapping tasks to agent capabilities
    • Automation notes with example grading script patterns

Notable Implementation Details

  • Grading Automation: Most rubric items (>80%) are fully automatable via shell commands, Python assertions, or exit-code checks. Only 4 items require manual human review (self-review quality, plan accuracy, error-free explanations).

  • Real Bug Catalog: Task 2 includes a table of 5 actual bugs seeded in the codebase for agents to discover and fix, with specific line numbers and verification methods.

  • Multi-Language Coverage: Tasks exercise TypeScript (Gateway), Python (Processor), Go (Metrics Engine, Read Model), and Rust (Web PTY Server, TUI) services.

  • Progressive Difficulty: Tasks range from Easy (linting, architecture explanation) to Hard (cross-service schema evolution, end-to-end feature implementation), allowing benchmarking of agents at different capability levels.

  • Deterministic Environment: All tasks run in a clean checkout with specified tool versions (Node 20, Python 3.11, Go 1.21, Rust 1.83) and wall-clock time limits (10–30 minutes).

Purpose

This benchmark enables objective, reproducible evaluation of AI CLI agents on realistic polyglot codebase tasks, with clear success criteria and automated grading where possible.

https://claude.ai/code/session_018FhKabQMCUFrhJq8TGS8dm

Define a structured benchmark suite for evaluating AI-backed CLI tools
against this polyglot codebase. Covers cross-service schema evolution,
bug detection, dependency upgrades, test authoring, self-review, feature
planning, linter compliance, health check standardization, and
architecture comprehension. Each task has a deterministic grading rubric
with automatable pass/fail criteria (106 total points across 10 tasks).

https://claude.ai/code/session_018FhKabQMCUFrhJq8TGS8dm
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants