Skip to content

feat(designer): Agent evaluations tab#8932

Draft
andrew-eldridge wants to merge 3 commits intomainfrom
aeldridge/agentEval
Draft

feat(designer): Agent evaluations tab#8932
andrew-eldridge wants to merge 3 commits intomainfrom
aeldridge/agentEval

Conversation

@andrew-eldridge
Copy link
Contributor

@andrew-eldridge andrew-eldridge commented Mar 17, 2026

Commit Type

  • feature - New functionality
  • fix - Bug fix
  • refactor - Code restructuring without behavior change
  • perf - Performance improvement
  • docs - Documentation update
  • test - Test-related changes
  • chore - Maintenance/tooling

Risk Level

  • Low - Minor changes, limited scope
  • Medium - Moderate changes, some user impact
  • High - Major changes, significant user/system impact

What & Why

Add agent evaluations functionality in a new designer tab. Allows users to evaluate A2A/agentic workflow runs using a predefined set of evaluators (tool call trajectory, semantic similarity, custom prompt). All evaluators either use reference runs as ground truth or a separate evaluator model as a judge.

Impact of Change

  • Users: Introduces new Agent Evaluations tab, new UI panels for runs/evaluators, and the ability to run evaluations from the designer. Only applies to A2A/agentic workflows
  • Developers: Adds new queries/hooks (react-query), a new Redux slice (evaluation), new UI components (EvaluateView and multiple panel components), new models and a StandardEvaluationService
  • System: API contract additions (new evaluator endpoints and run endpoints) and additional network calls for eval fetching/management and running evals

Test Plan

  • Unit tests added/updated
  • E2E tests added/updated
  • Manual testing completed
  • Tested in:

Contributors

@andrew-eldridge

Screenshots/Videos

@andrew-eldridge andrew-eldridge added the risk:medium Medium risk change with potential impact label Mar 17, 2026
@github-actions
Copy link

github-actions bot commented Mar 17, 2026

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

PR Title

  • Current: feat(designer): Agent evaluations tab
  • Issue: None — title is concise and follows conventional commit style. Consider adding a short scope or note if you want to call out a backend dependency (optional).
  • Recommendation: Keep as-is or optionally: feat(designer): Add Agent Evaluations tab and evaluation management UI

Commit Type

  • Properly selected (feature).
  • Note: Only one commit type is selected, which is correct for this change.

Risk Level

  • The PR declares Medium and a risk:medium label is present.
  • Assessment: The scale of this change (new UI surface, new Redux slice, new client-service APIs, many new network calls, and new models) reasonably maps to medium risk. However, because there are no unit/E2E tests included (see Test Plan below), I recommend either adding tests or temporarily increasing the risk level to High until testing is present.

What & Why

  • Current: "Add agent evaluations functionality in a new designer tab..." (present)
  • Issue: Acceptable and concise. Good description of feature & purpose.
  • Recommendation: No mandatory change; you may add one short line calling out backend contract changes (API endpoints) and whether a backend rollout is required.

Impact of Change

  • Impact section is present and explains Users/Developers/System impact.
  • Recommendation: Good. Consider calling out any migration steps or configuration toggles if the feature is behind a feature flag.

Test Plan

  • Assessment: Missing. The body has all test checkboxes unchecked and no explanation.

  • Issue: This PR adds significant functionality (new services, API calls, Redux state, and UI). Per repository guidance, if no unit tests or E2E tests are added there must be a clear manual testing justification and instructions. That is missing.

  • Recommendation (required to pass):

    • Add unit tests covering at minimum:
      • evaluationSlice reducer (actions, initial state, reducers like setSelectedEvaluator, setEvaluationResult, resetEvaluationState)
      • core queries (useEvaluators, useEvaluations) mocking EvaluationService to ensure caching/enable flags behave correctly
      • StandardEvaluationService: unit tests for constructed URIs and correct httpClient calls (mock httpClient)
      • Key UI components smoke tests: EvaluateView renders, EvaluatorManagementPanel lists evaluators, RunDatasetPanel lists runs and selects run/action
    • Add E2E tests or an integration test that exercises creating an evaluator, selecting a run, and running an evaluation (or provide a written, explicit manual test plan with steps, expected results, and environments if E2E not yet feasible).
    • Update the Test Plan section in the PR body to reflect the added tests (check boxes) and list the test environments.

    If for some reason tests cannot be added in this PR, explain why in the Test Plan and provide a follow-up ticket with timelines. Without tests or a clear justification, this PR should not be merged.


⚠️ Contributors

  • Assessment: Contributors listed (@andrew-eldridge). Good.
  • Recommendation: If PMs/designers/reviewers contributed, please tag them here as a courtesy (optional).

⚠️ Screenshots/Videos

  • Assessment: None provided. This PR adds a major UI tab and many panels.
  • Recommendation: Please add screenshots or a short screencast of the new Evaluate tab UI (Runs, Evaluators, Right panel, Result panel). This helps reviewers catch visual regressions.

Summary Table

Section Status Recommendation
Title Keep as-is
Commit Type OK
Risk Level ✅ (declared: Medium) Consider raising to High or add tests to keep Medium.
What & Why OK; optionally add backend rollout note
Impact of Change OK; mention feature flag or migration if any
Test Plan Add unit/E2E tests or a justified manual test plan
Contributors Add other contributors if applicable
Screenshots/Videos ⚠️ Add screenshots/screencast of UI changes

Final notes & action items

  • This PR cannot pass the PR body/title compliance check because the Test Plan is empty for a significant functional change. Please update the PR body Test Plan to either:
    1. Include unit tests + E2E/integration tests (preferred) and check the corresponding boxes, or
    2. Provide a clear, concrete manual testing plan and explain why automated tests are not feasible in this PR (and create a follow-up issue with a timeline for adding automated tests).
  • Add screenshots/screencast demonstrating the new Evaluate tab and key panels.
  • If you want to keep the risk level at Medium, please add the tests above. Otherwise, change label to risk:high and add a short justification for the higher risk (e.g., "no automated tests in this PR").

Please update the PR title/body with the requested Test Plan and screenshots, add the unit/E2E tests (or a clear manual test justification), and then re-submit. Thank you for the thorough feature implementation — once tests and screenshots are added this will be much closer to merge-ready.


Last updated: Tue, 17 Mar 2026 18:17:19 GMT

@github-actions
Copy link

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

PR Title

  • Current: feat(designer): Agent evaluations tab
  • Issue: None major — title follows conventional commit style and concisely describes the change.
  • Recommendation: Keep as-is or, if you want more precision, feat(designer): add Agent Evaluations tab and evaluation services to highlight both UI and service changes.

Commit Type

  • Properly selected (feature).
  • Note: Only one option selected which is correct.

⚠️ Risk Level

  • The PR body and label indicate: Medium risk (risk:medium).
  • Assessment: Based on the code diff (large feature additions across UI, core state/store, shared services, new API client, and models — ~2357 additions, 25 files changed), I advise a higher risk level: High.
    • Comment: This PR touches core libs (libs/designer-v2, libs/logic-apps-shared, store initialization), registers a new service, and introduces new runtime API calls. These changes can affect app initialization, API contracts, and global state. Please consider using risk:high so reviewers and release managers treat this accordingly.

⚠️ What & Why

  • Current: Add agent evaluations functionality in a new designer tab. Allows users to evaluate A2A/agentic workflow runs using a predefined set of evaluators (tool call trajectory, semantic similarity, custom prompt). All evaluators either use reference runs as ground truth or a separate evaluator model as a judge.
  • Issue: Clear and concise; good.
  • Recommendation: Optionally add a one-line summary of the major implementation changes (UI components, new evaluation service, store/slice additions) to help reviewers map the description to files changed.

Impact of Change

  • Issue: The Impact of Change section is present but minimal (System marked as N/A). Given the scope of changes, the impact is broader than indicated.
  • Recommendation:
    • Users: Introduces new Agent Evaluations tab, new UI panels for runs/evaluators, and the ability to run evaluations from the designer. This is user-facing; consider noting that feature may show for certain workflow kinds (agentic/stateful) only.
    • Developers: Adds new queries/hooks (react-query), a new Redux slice (evaluation), new UI components (EvaluateView and multiple panel components), new models and a StandardEvaluationService that calls backend endpoints. Call out any public surface/API changes in libs/logic-apps-shared since other repos may depend on them.
    • System: API contract additions (new evaluator endpoints and run endpoints) and additional network calls (evaluation run), potential increased telemetry and cost (model runs). Add: "System: adds new backend API usage and model/runtime evaluation calls; monitor API errors and performance."

Test Plan

  • Test Plan Assessment: Missing — the PR has no Unit tests, no E2E tests, and no manual testing notes.
  • Issue: The diff adds new services, complex UI flows, and new redux state but no tests. Per the repository guidance, if no unit or E2E tests are added then the PR must include a clear manual testing plan and justification. This PR currently lists none of the test checkboxes.
  • Recommendation (required before merging):
    • Add unit tests for the new redux slice (evaluationSlice) and selectors.
    • Add unit tests for queries/behaviour in libs/designer-v2/src/lib/core/queries/evaluations.ts (mock EvaluationService and verify query/mutation behavior and cache invalidation).
    • Add unit/component tests for main UI components (EvaluateView, EvaluatorsPanel, EvaluatorFormPanel) including form submission and run flow (mock services). Prefer snapshot or DOM tests for rendering critical flows.
    • If adding automated tests is not possible now, provide a detailed manual test plan explaining how reviewers can exercise the feature (steps to create evaluator, run evaluation, verify result, behavior for stateful vs stateless workflows) and why no automated tests were added. Manual plan should include expected results and error scenarios.

Contributors

  • Contributors Assessment: @andrew-eldridge is listed. Good to credit; if others helped (PM/Design/QA) consider adding them.

⚠️ Screenshots/Videos

  • Screenshots Assessment: Not provided. This is a UI-heavy change — I recommend adding screenshots or a short demo GIF showing the new Evaluate tab, run list, evaluator creation form, and a sample evaluation result. This helps reviewers and designers validate UX quickly.

Summary Table

Section Status Recommendation
Title Keep as-is or slightly expand for clarity
Commit Type OK
Risk Level ⚠️ Recommend bump to risk:high and update label
What & Why Good; optionally mention high-level files changed
Impact of Change Expand to list system-level impacts and API changes
Test Plan Add unit/E2E tests or a detailed manual test plan
Contributors OK; add others if applicable
Screenshots/Videos ⚠️ Add visual proof for UI changes

Summary:
This PR introduces a large feature set (new evaluation UI, new redux slice, queries, models, and a new StandardEvaluationService). Because this touches core libraries, the store, service initialization, and adds network/API interactions, I recommend raising the risk to High (please update label) and adding tests or a detailed manual test plan. At present, the PR does NOT pass the PR body checklist because the Test Plan is empty — please add automated tests or a robust manual testing section and address the risk label.

Please update the PR title/body with the following specific items and then re-submit:

  • Risk label: change to risk:high (comment in PR explaining why: touches core libs/store/services/API).
  • Test Plan: either add test files (unit tests for evaluationSlice, queries, EvaluateView components; integration/E2E flow that covers create/run evaluation) OR add a detailed manual testing section with step-by-step instructions and expected results.
  • Impact of Change: expand to describe system/backend/API impacts (new endpoints, potential runtime/cost), and any migration steps (none seen — if none, explicitly state so).
  • Screenshots/Videos: include a screenshot of the Evaluate tab, the create evaluator form, and an evaluation result (or a short demo GIF).

Thank you for the thorough implementation. Once tests/manual test plan and the risk label are addressed, this will be in much better shape for merging.

Helpful file-specific test suggestions:

  • libs/designer-v2/src/lib/core/state/evaluation/evaluationSlice.ts -> unit tests for reducer actions and reset behavior.
  • libs/designer-v2/src/lib/core/queries/evaluations.ts -> mock EvaluationService and test query keys, enabled/disabled logic, and onSuccess invalidations for mutations.
  • libs/logic-apps-shared/src/designer-client-services/lib/standard/evaluation.ts -> unit tests for URL/HTTP calls using a mocked IHttpClient.
  • EvaluateView & panels -> component tests for rendering states (empty, loading, error, result) and form submission flows (EvaluatorFormPanel).

Please update and ping reviewers when ready. Thank you!


Last updated: Tue, 17 Mar 2026 17:31:38 GMT

@github-actions
Copy link

github-actions bot commented Mar 17, 2026

📊 Coverage check completed. See workflow run for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-pr-update risk:medium Medium risk change with potential impact

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant