feat(designer): Agent evaluations tab by andrew-eldridge · Pull Request #8932 · Azure/LogicAppsUX

andrew-eldridge · 2026-03-17T17:28:35Z

Commit Type

feature - New functionality
fix - Bug fix
refactor - Code restructuring without behavior change
perf - Performance improvement
docs - Documentation update
test - Test-related changes
chore - Maintenance/tooling

Risk Level

Low - Minor changes, limited scope
Medium - Moderate changes, some user impact
High - Major changes, significant user/system impact

What & Why

Add agent evaluations functionality in a new designer tab. Allows users to evaluate A2A/agentic workflow runs using a predefined set of evaluators (tool call trajectory, semantic similarity, custom prompt). All evaluators either use reference runs as ground truth or a separate evaluator model as a judge.

Impact of Change

Users: Introduces new Agent Evaluations tab, new UI panels for runs/evaluators, and the ability to run evaluations from the designer. Only applies to A2A/agentic workflows
Developers: Adds new queries/hooks (react-query), a new Redux slice (evaluation), new UI components (EvaluateView and multiple panel components), new models and a StandardEvaluationService
System: API contract additions (new evaluator endpoints and run endpoints) and additional network calls for eval fetching/management and running evals

Test Plan

Unit tests added/updated
E2E tests added/updated
Manual testing completed
Tested in:

Contributors

@andrew-eldridge

Screenshots/Videos

github-actions · 2026-03-17T17:31:38Z

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

✅ PR Title

Current: feat(designer): Agent evaluations tab
Issue: None — title is concise and follows conventional commit style. Consider adding a short scope or note if you want to call out a backend dependency (optional).
Recommendation: Keep as-is or optionally: feat(designer): Add Agent Evaluations tab and evaluation management UI

✅ Commit Type

Properly selected (feature).
Note: Only one commit type is selected, which is correct for this change.

✅ Risk Level

The PR declares Medium and a risk:medium label is present.
Assessment: The scale of this change (new UI surface, new Redux slice, new client-service APIs, many new network calls, and new models) reasonably maps to medium risk. However, because there are no unit/E2E tests included (see Test Plan below), I recommend either adding tests or temporarily increasing the risk level to High until testing is present.

❌ What & Why

Current: "Add agent evaluations functionality in a new designer tab..." (present)
Issue: Acceptable and concise. Good description of feature & purpose.
Recommendation: No mandatory change; you may add one short line calling out backend contract changes (API endpoints) and whether a backend rollout is required.

✅ Impact of Change

Impact section is present and explains Users/Developers/System impact.
Recommendation: Good. Consider calling out any migration steps or configuration toggles if the feature is behind a feature flag.

❌ Test Plan

Assessment: Missing. The body has all test checkboxes unchecked and no explanation.
Issue: This PR adds significant functionality (new services, API calls, Redux state, and UI). Per repository guidance, if no unit tests or E2E tests are added there must be a clear manual testing justification and instructions. That is missing.
Recommendation (required to pass):
- Add unit tests covering at minimum:
  - evaluationSlice reducer (actions, initial state, reducers like setSelectedEvaluator, setEvaluationResult, resetEvaluationState)
  - core queries (useEvaluators, useEvaluations) mocking EvaluationService to ensure caching/enable flags behave correctly
  - StandardEvaluationService: unit tests for constructed URIs and correct httpClient calls (mock httpClient)
  - Key UI components smoke tests: EvaluateView renders, EvaluatorManagementPanel lists evaluators, RunDatasetPanel lists runs and selects run/action
- Add E2E tests or an integration test that exercises creating an evaluator, selecting a run, and running an evaluation (or provide a written, explicit manual test plan with steps, expected results, and environments if E2E not yet feasible).
- Update the Test Plan section in the PR body to reflect the added tests (check boxes) and list the test environments.
If for some reason tests cannot be added in this PR, explain why in the Test Plan and provide a follow-up ticket with timelines. Without tests or a clear justification, this PR should not be merged.

⚠️ Contributors

Assessment: Contributors listed (@andrew-eldridge). Good.
Recommendation: If PMs/designers/reviewers contributed, please tag them here as a courtesy (optional).

⚠️ Screenshots/Videos

Assessment: None provided. This PR adds a major UI tab and many panels.
Recommendation: Please add screenshots or a short screencast of the new Evaluate tab UI (Runs, Evaluators, Right panel, Result panel). This helps reviewers catch visual regressions.

Summary Table

Section	Status	Recommendation
Title	✅	Keep as-is
Commit Type	✅	OK
Risk Level	✅ (declared: Medium)	Consider raising to High or add tests to keep Medium.
What & Why	✅	OK; optionally add backend rollout note
Impact of Change	✅	OK; mention feature flag or migration if any
Test Plan	❌	Add unit/E2E tests or a justified manual test plan
Contributors	✅	Add other contributors if applicable
Screenshots/Videos	⚠️	Add screenshots/screencast of UI changes

Final notes & action items

This PR cannot pass the PR body/title compliance check because the Test Plan is empty for a significant functional change. Please update the PR body Test Plan to either:
1. Include unit tests + E2E/integration tests (preferred) and check the corresponding boxes, or
2. Provide a clear, concrete manual testing plan and explain why automated tests are not feasible in this PR (and create a follow-up issue with a timeline for adding automated tests).
Add screenshots/screencast demonstrating the new Evaluate tab and key panels.
If you want to keep the risk level at Medium, please add the tests above. Otherwise, change label to risk:high and add a short justification for the higher risk (e.g., "no automated tests in this PR").

Please update the PR title/body with the requested Test Plan and screenshots, add the unit/E2E tests (or a clear manual test justification), and then re-submit. Thank you for the thorough feature implementation — once tests and screenshots are added this will be much closer to merge-ready.

Last updated: Tue, 17 Mar 2026 18:17:19 GMT

github-actions · 2026-03-17T17:31:38Z

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

✅ PR Title

Current: feat(designer): Agent evaluations tab
Issue: None major — title follows conventional commit style and concisely describes the change.
Recommendation: Keep as-is or, if you want more precision, feat(designer): add Agent Evaluations tab and evaluation services to highlight both UI and service changes.

✅ Commit Type

Properly selected (feature).
Note: Only one option selected which is correct.

⚠️ Risk Level

The PR body and label indicate: Medium risk (risk:medium).
Assessment: Based on the code diff (large feature additions across UI, core state/store, shared services, new API client, and models — ~2357 additions, 25 files changed), I advise a higher risk level: High.
- Comment: This PR touches core libs (libs/designer-v2, libs/logic-apps-shared, store initialization), registers a new service, and introduces new runtime API calls. These changes can affect app initialization, API contracts, and global state. Please consider using risk:high so reviewers and release managers treat this accordingly.

⚠️ What & Why

Current: Add agent evaluations functionality in a new designer tab. Allows users to evaluate A2A/agentic workflow runs using a predefined set of evaluators (tool call trajectory, semantic similarity, custom prompt). All evaluators either use reference runs as ground truth or a separate evaluator model as a judge.
Issue: Clear and concise; good.
Recommendation: Optionally add a one-line summary of the major implementation changes (UI components, new evaluation service, store/slice additions) to help reviewers map the description to files changed.

❌ Impact of Change

Issue: The Impact of Change section is present but minimal (System marked as N/A). Given the scope of changes, the impact is broader than indicated.
Recommendation:
- Users: Introduces new Agent Evaluations tab, new UI panels for runs/evaluators, and the ability to run evaluations from the designer. This is user-facing; consider noting that feature may show for certain workflow kinds (agentic/stateful) only.
- Developers: Adds new queries/hooks (react-query), a new Redux slice (evaluation), new UI components (EvaluateView and multiple panel components), new models and a StandardEvaluationService that calls backend endpoints. Call out any public surface/API changes in libs/logic-apps-shared since other repos may depend on them.
- System: API contract additions (new evaluator endpoints and run endpoints) and additional network calls (evaluation run), potential increased telemetry and cost (model runs). Add: "System: adds new backend API usage and model/runtime evaluation calls; monitor API errors and performance."

❌ Test Plan

Test Plan Assessment: Missing — the PR has no Unit tests, no E2E tests, and no manual testing notes.
Issue: The diff adds new services, complex UI flows, and new redux state but no tests. Per the repository guidance, if no unit or E2E tests are added then the PR must include a clear manual testing plan and justification. This PR currently lists none of the test checkboxes.
Recommendation (required before merging):
- Add unit tests for the new redux slice (evaluationSlice) and selectors.
- Add unit tests for queries/behaviour in libs/designer-v2/src/lib/core/queries/evaluations.ts (mock EvaluationService and verify query/mutation behavior and cache invalidation).
- Add unit/component tests for main UI components (EvaluateView, EvaluatorsPanel, EvaluatorFormPanel) including form submission and run flow (mock services). Prefer snapshot or DOM tests for rendering critical flows.
- If adding automated tests is not possible now, provide a detailed manual test plan explaining how reviewers can exercise the feature (steps to create evaluator, run evaluation, verify result, behavior for stateful vs stateless workflows) and why no automated tests were added. Manual plan should include expected results and error scenarios.

✅ Contributors

Contributors Assessment: @andrew-eldridge is listed. Good to credit; if others helped (PM/Design/QA) consider adding them.

⚠️ Screenshots/Videos

Screenshots Assessment: Not provided. This is a UI-heavy change — I recommend adding screenshots or a short demo GIF showing the new Evaluate tab, run list, evaluator creation form, and a sample evaluation result. This helps reviewers and designers validate UX quickly.

Summary Table

Section	Status	Recommendation
Title	✅	Keep as-is or slightly expand for clarity
Commit Type	✅	OK
Risk Level	⚠️	Recommend bump to `risk:high` and update label
What & Why	✅	Good; optionally mention high-level files changed
Impact of Change	❌	Expand to list system-level impacts and API changes
Test Plan	❌	Add unit/E2E tests or a detailed manual test plan
Contributors	✅	OK; add others if applicable
Screenshots/Videos	⚠️	Add visual proof for UI changes

Summary:
This PR introduces a large feature set (new evaluation UI, new redux slice, queries, models, and a new StandardEvaluationService). Because this touches core libraries, the store, service initialization, and adds network/API interactions, I recommend raising the risk to High (please update label) and adding tests or a detailed manual test plan. At present, the PR does NOT pass the PR body checklist because the Test Plan is empty — please add automated tests or a robust manual testing section and address the risk label.

Please update the PR title/body with the following specific items and then re-submit:

Risk label: change to risk:high (comment in PR explaining why: touches core libs/store/services/API).
Test Plan: either add test files (unit tests for evaluationSlice, queries, EvaluateView components; integration/E2E flow that covers create/run evaluation) OR add a detailed manual testing section with step-by-step instructions and expected results.
Impact of Change: expand to describe system/backend/API impacts (new endpoints, potential runtime/cost), and any migration steps (none seen — if none, explicitly state so).
Screenshots/Videos: include a screenshot of the Evaluate tab, the create evaluator form, and an evaluation result (or a short demo GIF).

Thank you for the thorough implementation. Once tests/manual test plan and the risk label are addressed, this will be in much better shape for merging.

Helpful file-specific test suggestions:

libs/designer-v2/src/lib/core/state/evaluation/evaluationSlice.ts -> unit tests for reducer actions and reset behavior.
libs/designer-v2/src/lib/core/queries/evaluations.ts -> mock EvaluationService and test query keys, enabled/disabled logic, and onSuccess invalidations for mutations.
libs/logic-apps-shared/src/designer-client-services/lib/standard/evaluation.ts -> unit tests for URL/HTTP calls using a mocked IHttpClient.
EvaluateView & panels -> component tests for rendering states (empty, loading, error, result) and form submission flows (EvaluatorFormPanel).

Please update and ping reviewers when ready. Thank you!

Last updated: Tue, 17 Mar 2026 17:31:38 GMT

github-actions · 2026-03-17T17:33:23Z

📊 Coverage check completed. See workflow run for details.

…vice, update views

andrew-eldridge added 2 commits March 12, 2026 15:14

initial evaluations changes for designer

17bbfe9

add StandardEvaluationService implementation

2eff19d

andrew-eldridge requested review from ccastrotrejo and lambrianmsft March 17, 2026 17:28

andrew-eldridge added the risk:medium Medium risk change with potential impact label Mar 17, 2026

github-actions bot added the needs-pr-update label Mar 17, 2026

add evals to vscode designer, add agent chat panel, refactor eval ser…

c488ba1

…vice, update views

github-actions bot mentioned this pull request Mar 18, 2026

[repo-status] Daily Repo Status — March 18, 2026 #8934

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(designer): Agent evaluations tab#8932

feat(designer): Agent evaluations tab#8932
andrew-eldridge wants to merge 3 commits intomainfrom
aeldridge/agentEval

andrew-eldridge commented Mar 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrew-eldridge commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commit Type

Risk Level

What & Why

Impact of Change

Test Plan

Contributors

Screenshots/Videos

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

✅ PR Title

✅ Commit Type

✅ Risk Level

❌ What & Why

✅ Impact of Change

❌ Test Plan

⚠️ Contributors

⚠️ Screenshots/Videos

Summary Table

Uh oh!

github-actions bot commented Mar 17, 2026

🤖 AI PR Validation Report

PR Review Results

Thank you for your submission! Here's detailed feedback on your PR title and body compliance:

✅ PR Title

✅ Commit Type

⚠️ Risk Level

⚠️ What & Why

❌ Impact of Change

❌ Test Plan

✅ Contributors

⚠️ Screenshots/Videos

Summary Table

Thank you for the thorough implementation. Once tests/manual test plan and the risk label are addressed, this will be in much better shape for merging.

Please update and ping reviewers when ready. Thank you!

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrew-eldridge commented Mar 17, 2026 •

edited

Loading

github-actions bot commented Mar 17, 2026 •

edited

Loading

github-actions bot commented Mar 17, 2026 •

edited

Loading