Skip to content

Feature/nvidia if bench validators integrations#801

Open
abubakaria56 wants to merge 37 commits intoNVIDIA-NeMo:mainfrom
abubakaria56:feature/nvidia-IF-bench-validators-integrations
Open

Feature/nvidia if bench validators integrations#801
abubakaria56 wants to merge 37 commits intoNVIDIA-NeMo:mainfrom
abubakaria56:feature/nvidia-IF-bench-validators-integrations

Conversation

@abubakaria56
Copy link

@abubakaria56 abubakaria56 commented Mar 2, 2026

Turing VIF

A validation service in the NeMo Gym pipeline that answers: Did the model’s reply follow all the instructions? It does not generate text; it only grades one reply against a list of instructions and returns pass/fail and a reward (0 or 1) for training.

Two Kinds of Validators

  • Fast (rule-based): Counts and patterns (word count, start/end phrase, keywords, forbidden words, lists, tables, case, punctuation, etc.). Runs on the server; no external API.
  • Judge (LLM-based): Meaning and style (tone, formality, politeness, role, audience, custom yes/no questions). The server calls a separate “judge” LLM; those calls run in parallel.

Multi-language support

Turing VIF supports multiple languages (English (en), Spanish (es), French (fre), German (de), Italian (it), Brazilian Portuguese (pt-BR), Japanese (ja), Chinese (zh), Korean (ko), Hindi (hi), Arabic (ar)). Language is set per request.

  • Fast validators use language-specific rules (word boundaries, sentence splitting, vowels, case).
  • Case-related instructions are disabled where they don’t apply (e.g. “all caps” for Chinese/Japanese).
  • If any instruction is not supported for the requested language, the request fails early and the reply is not validated.

What Happens When It Runs

  1. Request arrives with the model reply, list of instructions, optional custom judge questions, and language.
  2. Basic checks: Request format and language support. If invalid or unsupported language/instruction, return failure and do not validate.
  3. Instructions are split into fast vs judge.
  4. Fast validators run sequentially; each returns Pass/Fail + short message.
  5. Judge validators (and custom questions) run in parallel; judge returns YES/NO → Pass/Fail.
  6. Misalignment: For instructions marked “misalignment,” if the model did follow them, the result is flipped to Fail (we want the model not to follow those).
  7. Result: One reward (1.0 if all passed, 0.0 otherwise), a per-instruction pass/fail list, and validation messages. Training uses this reward and list.

Results, errors, and skipping rollouts

  • Results: Each verify response includes:
    • reward (1.0 or 0.0)
    • follow_all_instructions
    • follow_instruction_list (one boolean per instruction)
    • validation_results (per instruction: ID, status Passed/Failed/Skipped, message)
      Training uses reward and the list; validation_results are for debugging and evals.
  • Errors: Schema validation or language compatibility (unsupported instruction for that language) → reward 0, validation_results with status Failed → rollout skipped (not used for training, recorded e.g. in errors.json).
    Per-instruction failures (valid request, some instructions fail) → reward 0, validation_results Passed/Failed per instruction → rollout not skipped (still used for training with reward 0).
  • Skipping: Rollouts are skipped only on schema or language-compatibility failure. When validators run and some instructions fail, the rollout is not skipped; it’s processed with reward 0.

Why rollout_collection.py was changed

The original breaks when a rollout returns a schema_validation error (malformed instruction payload) or a language_compatibility error (instruction not supported for the requested language). Without sorting input rows by (task_index, rollout_index) before writing the materialized-inputs file, index assignments become inconsistent when rollouts are skipped — causing resume_from_cache to re-run completed rollouts or skip pending ones. The updated version sorts rows upfront so indices are stable, skipped rollouts are cleanly excluded from training, and errors are flagged in the verify response (should_skip_rollout=True) so the collector writes them to errors.json instead of the output JSONL — keeping results.jsonl, _reward_profiling.jsonl, and _agent_metrics.json limited to successful rollouts only.

Known test failures in responses_api_agents/swe_agents

Two pre-existing issues unrelated to Turing VIF:

  • test_find_container_lowercase_match: _find_container returns the mixed-case constructed path instead of the actual lowercase filename on disk after applying the ___1776_ substitution.
  • TestSetupSwebenchEnvironment, TestSetupR2eGymEnvironment, TestSetupOpenhandsEnvironment (test_already_exists): macOS-only failure — tempfile.TemporaryDirectory() returns /var/folders/... but the setup functions resolve symlinks, returning /private/var/folders/..., so the path equality check fails.

dhrutisundar-turing and others added 28 commits January 7, 2026 14:52
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
…flagging

Flagging validation issues and writing them into error.json

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
…lti-lang-support

[IFTL-218] Multi-Lang Support

Signed-off-by: Dhrutisundar Sahoo <dhrutisundar.sahoo@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
…r-turing/Nvidia-gym-turing into fixes/lang_validator

Signed-off-by: qasimo-debug <qasim.o@turing.com>
Signed-off-by: qasimo-debug <qasim.o@turing.com>
…validator

Fixes/lang validator

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Remove the "Where Do Reward Scores Come From?" note that implied custom
verification logic is optional. Also fix tutorial goals to match actual
content and correct the resource server name.

Fixes NVIDIA-NeMo#776

Signed-off-by: Chris Wing <cwing@nvidia.com>
change tutorial card est time from 45-90 to 30 mins as in the tutorial
itself

NVIDIA-NeMo#780

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>
…t tutorials section (NVIDIA-NeMo#785)

Signed-off-by: Brian Yu <bxyu@nvidia.com>

Signed-off-by: bxyu-nvidia <bxyu@nvidia.com>
5927179

Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
- Extract _preprocess_rows_from_config from duplicate run_from_config
- Add missing imports: json, deepcopy, Union, Literal
- Add return results to run_from_config
- Remove dead _post_coroutine block (undefined server_client reference)

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…rtifacts

- Remove # pragma: no cover from RolloutCollectionHelper class
- Drop stale how_to_start.md entry from .gitignore
- Delete tracked example_rollouts.jsonl generated artifact

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
Auto-generated by update-readme-table pre-commit hook.

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Made-with: Cursor
…s.json

Rollouts that return a schema_validation or language_compatibility error
are now excluded from results.jsonl, reward profiling, and agent metrics.
TuringVIFVerifyResponse gains a should_skip_rollout flag (set to True on
those two error paths) which rollout_collection.py reads to route the
result into errors.json instead of the main output. On resume_from_cache,
already-errored rollout keys are loaded from errors.json so they are not
re-run. The finish summary now prints successful vs skipped counts.

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
@abubakaria56 abubakaria56 force-pushed the feature/nvidia-IF-bench-validators-integrations branch from e6071f1 to 14a8a9b Compare March 2, 2026 20:51
…th inline comments

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
…s_rows_from_config

Signed-off-by: Al-Waasiu Abubakari <abubakari.a@turing.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants