Skip to content

langwatch x nebius: builder hour#4

Open
Aryansharma28 wants to merge 5 commits intomainfrom
nebius-demo
Open

langwatch x nebius: builder hour#4
Aryansharma28 wants to merge 5 commits intomainfrom
nebius-demo

Conversation

@Aryansharma28
Copy link

@Aryansharma28 Aryansharma28 commented Feb 13, 2026

Summary

This PR adds comprehensive, realistic test scenarios for evaluating bank customer support agents across multiple LLM models from Nebius token Factory

What's in here?

  • 5 Model Variants: DeepSeek, GLM, MiniMax, OpenAI (GPT-oss), and Claude Sonnet 4.5
  • 4 Real-World Scenarios:
    1. 🚨 Fraud Investigation - Unauthorized transactions and account security
    2. 📞 Escalation Workflow - Customer demands to speak with manager
    3. 🔧 Complex Multi-Issue - Multiple interconnected banking problems
    4. ⚡ Urgent Business - Frozen account affecting payroll

rogeriochaves and others added 4 commits December 6, 2025 09:30
- Add 4 realistic test scenarios across all models (DeepSeek, GLM, MiniMax, OpenAI, Claude)
- Scenarios: fraud investigation, escalation, complex multi-issue, urgent business
- Improved test design with realistic conversation flows and customer ID (CUST_001)
- Removed prescriptive tool assertions in favor of outcome-based evaluation
- Updated judge criteria to focus on real-world customer service quality (5 criteria per test)
- All judges now use GPT-4o for consistent evaluation
- Added Claude Sonnet 4.5 model for comparison
- Test results: Claude Sonnet 4.5 and GLM achieved 100% pass rate
@Aryansharma28 Aryansharma28 changed the title Add comprehensive realistic test scenarios for agent evaluation langwatch x nebius: builder hour Feb 13, 2026
- Update all test files to use Claude Sonnet 4.5 as User Simulator
- Update all test files to use Claude Sonnet 4.5 as Judge
- Change test_demo_openai.py agent model from GPT-4o to openai/gpt-oss-120b (OSS)
- Ensures consistent evaluation across all open-source agent models
- All agent models are now open-source except Claude benchmark file

Agent models tested:
- DeepSeek-V3.2 (Nebius OSS)
- GLM-4.7-FP8 (Nebius OSS)
- MiniMax-M2.1 (Nebius OSS)
- openai/gpt-oss-120b (OSS)
- Claude Sonnet 4.5 (benchmark)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants