langwatch x nebius: builder hour by Aryansharma28 · Pull Request #4 · langwatch/bank-example

Aryansharma28 · 2026-02-13T13:59:28Z

Summary

This PR adds comprehensive, realistic test scenarios for evaluating bank customer support agents across multiple LLM models from Nebius token Factory

What's in here?

5 Model Variants: DeepSeek, GLM, MiniMax, OpenAI (GPT-oss), and Claude Sonnet 4.5
4 Real-World Scenarios:
1. 🚨 Fraud Investigation - Unauthorized transactions and account security
2. 📞 Escalation Workflow - Customer demands to speak with manager
3. 🔧 Complex Multi-Issue - Multiple interconnected banking problems
4. ⚡ Urgent Business - Frozen account affecting payroll

- Add 4 realistic test scenarios across all models (DeepSeek, GLM, MiniMax, OpenAI, Claude) - Scenarios: fraud investigation, escalation, complex multi-issue, urgent business - Improved test design with realistic conversation flows and customer ID (CUST_001) - Removed prescriptive tool assertions in favor of outcome-based evaluation - Updated judge criteria to focus on real-world customer service quality (5 criteria per test) - All judges now use GPT-4o for consistent evaluation - Added Claude Sonnet 4.5 model for comparison - Test results: Claude Sonnet 4.5 and GLM achieved 100% pass rate

- Update all test files to use Claude Sonnet 4.5 as User Simulator - Update all test files to use Claude Sonnet 4.5 as Judge - Change test_demo_openai.py agent model from GPT-4o to openai/gpt-oss-120b (OSS) - Ensures consistent evaluation across all open-source agent models - All agent models are now open-source except Claude benchmark file Agent models tested: - DeepSeek-V3.2 (Nebius OSS) - GLM-4.7-FP8 (Nebius OSS) - MiniMax-M2.1 (Nebius OSS) - openai/gpt-oss-120b (OSS) - Claude Sonnet 4.5 (benchmark)

rogeriochaves and others added 4 commits December 6, 2025 09:30

migrate from openai to nebius

0b9a637

nebius demo

3b8c1f7

fix: tests

f2e97c8

Aryansharma28 changed the title ~~Add comprehensive realistic test scenarios for agent evaluation~~ langwatch x nebius: builder hour Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

langwatch x nebius: builder hour#4

langwatch x nebius: builder hour#4
Aryansharma28 wants to merge 5 commits intomainfrom
nebius-demo

Aryansharma28 commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aryansharma28 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in here?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aryansharma28 commented Feb 13, 2026 •

edited

Loading