feat: Add XSTest safety benchmark resource server by dcfarris · Pull Request #764 · NVIDIA-NeMo/Gym

dcfarris · 2026-02-25T20:41:50Z

Summary

Adds XSTest safety benchmark (Rottger et al., NAACL 2024) as a new resource server
450 hand-crafted prompts (250 safe, 200 unsafe) that test whether models correctly comply with safe requests and refuse unsafe ones
Two verification modes: string-match classification (from the paper) and LLM-as-judge with the paper's 3-class taxonomy (full compliance / full refusal / partial refusal)
Judge is optional (use_judge config flag), defaults to string-match, falls back to string-match on judge errors
Includes per-category aggregate reporting script and documentation

What's in this PR

resources_servers/xstest/app.py - resource server with string-match + LLM judge
resources_servers/xstest/configs/xstest.yaml - config with optional judge model server
resources_servers/xstest/prompt_templates/xstest_judge.txt - judge prompt from the paper
resources_servers/xstest/data/example.jsonl - 5 example prompts for smoke testing
resources_servers/xstest/tests/test_app.py - 24 tests across string-match, verdict parsing, and judge integration
resources_servers/xstest/scripts/aggregate_results.py - results aggregation (overall score, unsafe rate, over-refusal rate, per-category breakdown, judge error tracking)
resources_servers/xstest/README.md - input schema, reward scoring tables, usage examples

Validation

Smoke tested end-to-end with ng_collect_rollouts on all 450 prompts
Tested both string-match and LLM-as-judge verification modes
0% judge empty rate, <1% judge error rate in end-to-end runs

Test plan

pytest resources_servers/xstest/tests/test_app.py - 24 tests passing
ruff check and ruff format clean
End-to-end rollout collection on example data and full dataset
Full 450-prompt run with LLM judge, aggregate reporting verified
ng_test +entrypoint=resources_servers/xstest
ng_prepare_data validation
Dataset upload to registry

copy-pr-bot · 2026-02-25T20:41:54Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Adds XSTest (Rottger et al., NAACL 2024) as a new resource server for evaluating LLM over-refusal behavior. 450 hand-crafted prompts (250 safe, 200 unsafe) test whether models correctly comply with safe requests and refuse unsafe ones. Two verification modes: - String-match classification (from the paper's 21 refusal prefixes) - LLM-as-judge with the paper's 3-class taxonomy (optional, configurable) Judge falls back to string matching on errors. Includes per-category aggregate reporting script and documentation. Signed-off-by: Dave Farris <dfarris@nvidia.com>

- Remove system prompt from dataset (XSTest tests raw prompt behavior) - Document recommended generation params (temp=1.0, max_output_tokens=32768) - Update README with no-system-prompt rationale and gen param guidance - Simplify _parse_judge_verdict (remove unnecessary variable) - Simplify test_sanity and aggregate_results.py Signed-off-by: Dave Farris <dfarris@nvidia.com>

Auto-generated by update-readme-table pre-commit hook. Signed-off-by: Dave Farris <dfarris@nvidia.com>

dcfarris added 3 commits March 4, 2026 15:33

chore: Add xstest to resource server README table

88b4cec

Auto-generated by update-readme-table pre-commit hook. Signed-off-by: Dave Farris <dfarris@nvidia.com>

dcfarris force-pushed the dfarris/add-xstest-benchmark branch from adf194f to 88b4cec Compare March 4, 2026 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add XSTest safety benchmark resource server#764

feat: Add XSTest safety benchmark resource server#764
dcfarris wants to merge 3 commits intomainfrom
dfarris/add-xstest-benchmark

dcfarris commented Feb 25, 2026

Uh oh!

copy-pr-bot bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dcfarris commented Feb 25, 2026

Summary

What's in this PR

Validation

Test plan

Uh oh!

copy-pr-bot bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant