Add CLI Agent Benchmark Tasks documentation by oddessentials · Pull Request #182 · oddessentials/odd-demonstration

oddessentials · 2026-02-06T16:52:16Z

Summary

This PR introduces a comprehensive benchmark suite for evaluating AI-backed CLI tools against the Distributed Task Observatory polyglot codebase. The document defines 10 deterministic tasks spanning multiple difficulty levels and capability areas, each with detailed grading rubrics and automated verification criteria.

Key Changes

New file: docs/CLI_BENCHMARK_TASKS.md (408 lines)
- 10 benchmark tasks covering schema evolution, bug detection, dependency management, feature implementation, testing, code review, architecture planning, linting, standardization, and code comprehension
- Each task includes:
  - Clear prompt given to the AI agent
  - Specific grading rubric with pass/fail criteria
  - Automated verification commands where applicable
  - Time limits and difficulty ratings
- Scoring summary with 106 total points across all tasks
- Capability matrix mapping tasks to agent capabilities
- Automation notes with example grading script patterns

Notable Implementation Details

Grading Automation: Most rubric items (>80%) are fully automatable via shell commands, Python assertions, or exit-code checks. Only 4 items require manual human review (self-review quality, plan accuracy, error-free explanations).
Real Bug Catalog: Task 2 includes a table of 5 actual bugs seeded in the codebase for agents to discover and fix, with specific line numbers and verification methods.
Multi-Language Coverage: Tasks exercise TypeScript (Gateway), Python (Processor), Go (Metrics Engine, Read Model), and Rust (Web PTY Server, TUI) services.
Progressive Difficulty: Tasks range from Easy (linting, architecture explanation) to Hard (cross-service schema evolution, end-to-end feature implementation), allowing benchmarking of agents at different capability levels.
Deterministic Environment: All tasks run in a clean checkout with specified tool versions (Node 20, Python 3.11, Go 1.21, Rust 1.83) and wall-clock time limits (10–30 minutes).

Purpose

This benchmark enables objective, reproducible evaluation of AI CLI agents on realistic polyglot codebase tasks, with clear success criteria and automated grading where possible.

https://claude.ai/code/session_018FhKabQMCUFrhJq8TGS8dm

Define a structured benchmark suite for evaluating AI-backed CLI tools against this polyglot codebase. Covers cross-service schema evolution, bug detection, dependency upgrades, test authoring, self-review, feature planning, linter compliance, health check standardization, and architecture comprehension. Each task has a deterministic grading rubric with automatable pass/fail criteria (106 total points across 10 tasks). https://claude.ai/code/session_018FhKabQMCUFrhJq8TGS8dm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CLI Agent Benchmark Tasks documentation#182

Add CLI Agent Benchmark Tasks documentation#182
oddessentials wants to merge 1 commit intomainfrom
claude/plan-cli-benchmark-tasks-bDj8m

oddessentials commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oddessentials commented Feb 6, 2026

Summary

Key Changes

Notable Implementation Details

Purpose

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants