feat: Add XSTest safety benchmark resource server#764
Draft
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Adds XSTest (Rottger et al., NAACL 2024) as a new resource server for evaluating LLM over-refusal behavior. 450 hand-crafted prompts (250 safe, 200 unsafe) test whether models correctly comply with safe requests and refuse unsafe ones. Two verification modes: - String-match classification (from the paper's 21 refusal prefixes) - LLM-as-judge with the paper's 3-class taxonomy (optional, configurable) Judge falls back to string matching on errors. Includes per-category aggregate reporting script and documentation. Signed-off-by: Dave Farris <dfarris@nvidia.com>
- Remove system prompt from dataset (XSTest tests raw prompt behavior) - Document recommended generation params (temp=1.0, max_output_tokens=32768) - Update README with no-system-prompt rationale and gen param guidance - Simplify _parse_judge_verdict (remove unnecessary variable) - Simplify test_sanity and aggregate_results.py Signed-off-by: Dave Farris <dfarris@nvidia.com>
Auto-generated by update-readme-table pre-commit hook. Signed-off-by: Dave Farris <dfarris@nvidia.com>
adf194f to
88b4cec
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
use_judgeconfig flag), defaults to string-match, falls back to string-match on judge errorsWhat's in this PR
resources_servers/xstest/app.py- resource server with string-match + LLM judgeresources_servers/xstest/configs/xstest.yaml- config with optional judge model serverresources_servers/xstest/prompt_templates/xstest_judge.txt- judge prompt from the paperresources_servers/xstest/data/example.jsonl- 5 example prompts for smoke testingresources_servers/xstest/tests/test_app.py- 24 tests across string-match, verdict parsing, and judge integrationresources_servers/xstest/scripts/aggregate_results.py- results aggregation (overall score, unsafe rate, over-refusal rate, per-category breakdown, judge error tracking)resources_servers/xstest/README.md- input schema, reward scoring tables, usage examplesValidation
ng_collect_rolloutson all 450 promptsTest plan
pytest resources_servers/xstest/tests/test_app.py- 24 tests passingruff checkandruff formatcleanng_test +entrypoint=resources_servers/xstestng_prepare_datavalidation