Feature/nvidia if bench validators integrations by abubakaria56 · Pull Request #801 · NVIDIA-NeMo/Gym

abubakaria56 · 2026-03-02T17:27:19Z

Turing VIF

A validation service in the NeMo Gym pipeline that answers: Did the model’s reply follow all the instructions? It does not generate text; it only grades one reply against a list of instructions and returns pass/fail and a reward (0 or 1) for training.

Two Kinds of Validators

Fast (rule-based): Counts and patterns (word count, start/end phrase, keywords, forbidden words, lists, tables, case, punctuation, etc.). Runs on the server; no external API.
Judge (LLM-based): Meaning and style (tone, formality, politeness, role, audience, custom yes/no questions). The server calls a separate “judge” LLM; those calls run in parallel.

Multi-language support

Turing VIF supports multiple languages (English (en), Spanish (es), French (fre), German (de), Italian (it), Brazilian Portuguese (pt-BR), Japanese (ja), Chinese (zh), Korean (ko), Hindi (hi), Arabic (ar)). Language is set per request.

Fast validators use language-specific rules (word boundaries, sentence splitting, vowels, case).
Case-related instructions are disabled where they don’t apply (e.g. “all caps” for Chinese/Japanese).
If any instruction is not supported for the requested language, the request fails early and the reply is not validated.

What Happens When It Runs

Request arrives with the model reply, list of instructions, optional custom judge questions, and language.
Basic checks: Request format and language support. If invalid or unsupported language/instruction, return failure and do not validate.
Instructions are split into fast vs judge.
Fast validators run sequentially; each returns Pass/Fail + short message.
Judge validators (and custom questions) run in parallel; judge returns YES/NO → Pass/Fail.
Misalignment: For instructions marked “misalignment,” if the model did follow them, the result is flipped to Fail (we want the model not to follow those).
Result: One reward (1.0 if all passed, 0.0 otherwise), a per-instruction pass/fail list, and validation messages. Training uses this reward and list.

Results, errors, and skipping rollouts

Results: Each verify response includes:
- reward (1.0 or 0.0)
- follow_all_instructions
- follow_instruction_list (one boolean per instruction)
- validation_results (per instruction: ID, status Passed/Failed/Skipped, message)
  Training uses reward and the list; validation_results are for debugging and evals.
Errors: Schema validation or language compatibility (unsupported instruction for that language) → reward 0, validation_results with status Failed → rollout skipped (not used for training, recorded e.g. in errors.json).
Per-instruction failures (valid request, some instructions fail) → reward 0, validation_results Passed/Failed per instruction → rollout not skipped (still used for training with reward 0).
Skipping: Rollouts are skipped only on schema or language-compatibility failure. When validators run and some instructions fail, the rollout is not skipped; it’s processed with reward 0.

Why `rollout_collection.py` was changed

The original breaks when a rollout returns a schema_validation error (malformed instruction payload) or a language_compatibility error (instruction not supported for the requested language). Without sorting input rows by (task_index, rollout_index) before writing the materialized-inputs file, index assignments become inconsistent when rollouts are skipped — causing resume_from_cache to re-run completed rollouts or skip pending ones. The updated version sorts rows upfront so indices are stable, skipped rollouts are cleanly excluded from training, and errors are flagged in the verify response (should_skip_rollout=True) so the collector writes them to errors.json instead of the output JSONL — keeping results.jsonl, _reward_profiling.jsonl, and _agent_metrics.json limited to successful rollouts only.