Skip to content

feat: Add XSTest safety benchmark resource server#764

Draft
dcfarris wants to merge 3 commits intomainfrom
dfarris/add-xstest-benchmark
Draft

feat: Add XSTest safety benchmark resource server#764
dcfarris wants to merge 3 commits intomainfrom
dfarris/add-xstest-benchmark

Conversation

@dcfarris
Copy link

Summary

  • Adds XSTest safety benchmark (Rottger et al., NAACL 2024) as a new resource server
  • 450 hand-crafted prompts (250 safe, 200 unsafe) that test whether models correctly comply with safe requests and refuse unsafe ones
  • Two verification modes: string-match classification (from the paper) and LLM-as-judge with the paper's 3-class taxonomy (full compliance / full refusal / partial refusal)
  • Judge is optional (use_judge config flag), defaults to string-match, falls back to string-match on judge errors
  • Includes per-category aggregate reporting script and documentation

What's in this PR

  • resources_servers/xstest/app.py - resource server with string-match + LLM judge
  • resources_servers/xstest/configs/xstest.yaml - config with optional judge model server
  • resources_servers/xstest/prompt_templates/xstest_judge.txt - judge prompt from the paper
  • resources_servers/xstest/data/example.jsonl - 5 example prompts for smoke testing
  • resources_servers/xstest/tests/test_app.py - 24 tests across string-match, verdict parsing, and judge integration
  • resources_servers/xstest/scripts/aggregate_results.py - results aggregation (overall score, unsafe rate, over-refusal rate, per-category breakdown, judge error tracking)
  • resources_servers/xstest/README.md - input schema, reward scoring tables, usage examples

Validation

  • Smoke tested end-to-end with ng_collect_rollouts on all 450 prompts
  • Tested both string-match and LLM-as-judge verification modes
  • 0% judge empty rate, <1% judge error rate in end-to-end runs

Test plan

  • pytest resources_servers/xstest/tests/test_app.py - 24 tests passing
  • ruff check and ruff format clean
  • End-to-end rollout collection on example data and full dataset
  • Full 450-prompt run with LLM judge, aggregate reporting verified
  • ng_test +entrypoint=resources_servers/xstest
  • ng_prepare_data validation
  • Dataset upload to registry

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 25, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

dcfarris added 3 commits March 4, 2026 15:33
Adds XSTest (Rottger et al., NAACL 2024) as a new resource server for
evaluating LLM over-refusal behavior. 450 hand-crafted prompts (250 safe,
200 unsafe) test whether models correctly comply with safe requests and
refuse unsafe ones.

Two verification modes:
- String-match classification (from the paper's 21 refusal prefixes)
- LLM-as-judge with the paper's 3-class taxonomy (optional, configurable)

Judge falls back to string matching on errors. Includes per-category
aggregate reporting script and documentation.

Signed-off-by: Dave Farris <dfarris@nvidia.com>
- Remove system prompt from dataset (XSTest tests raw prompt behavior)
- Document recommended generation params (temp=1.0, max_output_tokens=32768)
- Update README with no-system-prompt rationale and gen param guidance
- Simplify _parse_judge_verdict (remove unnecessary variable)
- Simplify test_sanity and aggregate_results.py

Signed-off-by: Dave Farris <dfarris@nvidia.com>
Auto-generated by update-readme-table pre-commit hook.

Signed-off-by: Dave Farris <dfarris@nvidia.com>
@dcfarris dcfarris force-pushed the dfarris/add-xstest-benchmark branch from adf194f to 88b4cec Compare March 4, 2026 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant