fix: pipecleaning 0.1.1 environments for rl by fsiino-nvidia · Pull Request #748 · NVIDIA-NeMo/Gym

fsiino-nvidia · 2026-02-23T20:04:01Z

See #768

~~This PR contains infra + environment-specific changes made during pipecleaning efforts to get the respective environments to run successfully in nemo-rl.~~

rollout_collection:
Any server crash, network timeout, model error or resource exhaustion can cause a rollout to fail. Any single failed row kills the entire batch. We return a zero-reward fallback after waiting for ROLLOUT_ROW_TIMEOUT_SECONDS.

equivalence_llm_judge:
Previously any overly long generated answers from a thinking model would overflow the judge model's context window, crash the judge call and lead to cascading rollout failures. The implementation allows opt-in via max_judge_input_tokens and chars_per_token_estimate to truncate the generated answer before passing it to the judge model.

math_with_code:
Adds fallback for \boxed{} answer extraction. Previously only assistant text messages were searched for the answer. If a model were to answer via code execution, it ended up getting a zero reward. This fallback also searches function_call_output so an answer can be extracted from the tool output if the model prints it inside the executed code.

math_with_judge: ~~
~~- Add config fields for judge truncation, similar to equivalence_llm_judge.~~
~~- Similar to math_with_code:~~
~~ - Send extracted answer parsed from \boxed{} to judge model instead of the raw generated answer to avoid overwhelming the judge's context window.
~~ - Expected answers wrapped in $...$ produce \boxed{\42\)} which is not parsable by the math_verify library. This strips the outer delimiters to fix parsing.~~
~~- Prevent crash if the extracted answer is empty.~~

mcqa:
Address null option values in the dataset. Previously these were skipped which led to invalid rows being collected and training crashes.

mini_swe_agent:
Add a standalone data processing script to convert raw SWE-Bench/SWE-Gym data into the nemo-gym training format.

copy-pr-bot · 2026-02-23T20:04:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

…llection Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

…r handling, simplified judge prompt Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

…racted answer for judge, add truncation and warmup support Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

…er normalization, numeric fallback, max_steps limit Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

issue described in #736 collate_samples writes metrics without the dataset config metadata eg name type jsonl_fpath while validate_samples_and_aggregate_emtrics wrote those with metadata , causing a conflict. this mirrors what the validate step does to merge the metadata before the conflict check Signed-off-by: cmunley1 <cmunley@nvidia.com>

Signed-off-by: Christian Munley <cmunley@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

# Contributing To NeMo-Gym (NewtonBench Resource Server) ## 1) Basic information ### i. Description of the environment A resource server wrapping the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark - **Tasks:** 324 scientific law discovery tasks across 12 physics domains. - **Observation Space:** Experimental results (numeric or structured dictionaries) returned after tool use. - **Tools:** - `run_experiment`: Query the environment with specific parameters to receive physical observations. - `execute_python`: (Optional) Python code-assisted discovery for complex data analysis. - **Server:** FastAPI resource server following NeMo Gym conventions. ### ii. Description of the verification logic The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law: - **Law Extraction:** Attempts to find a law within `<final_law>` tags in the assistant's final response. - **Success Criteria:** Evaluates both **symbolic equivalence** (via an LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error - RMSLE). - **Reward Calculation:** - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`. - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise. - $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$. - `/verify` endpoint processes the agent's submission and returns these detailed performance metrics. ### iii. Description of the prompts/tasks (source + domain) **Domain:** Maths (Scientific Law Discovery). **Source:** Tasks and prompts adapted from the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments. ### iv. License information - **Code:** Apache 2.0. - **Data:** Apache 2.0 - **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp). ## 2) Environment validity check ### i. Commands used to collect rollouts ```bash # Start NeMo Gym servers (agent + NewtonBench) config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\ responses_api_models/vllm_model/configs/vllm_model.yaml" ng_run "+config_paths=[$config_paths]" # Collect sample rollouts ng_collect_rollouts \ +agent_name=newton_bench_simple_agent \ +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \ +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \ +limit=5 # View rollouts ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl ``` ### ii. Resulting rollouts (5 examples) See `resources_servers/newton_bench/data/example_rollouts.jsonl` **Expected behavior:** - Agent performs several experiments, analyzes data, and submits a scientific law. - Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0). - Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative. ## 3) Tests ### i. Commands used to run the tests ```bash source resources_servers/newton_bench/.venv/bin/activate pytest resources_servers/newton_bench/tests/test_app.py ``` **Coverage notes:** Resource server tests provide comprehensive coverage of the following areas: - **Session Lifecycle:** Successful seeding, error handling for invalid modules, session ending, and background cleanup. - **Experiment Execution:** Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc. - **Python Sandbox:** Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations). - **Verification Logic:** Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE. ## 4) Reward profiling **Models:** Qwen/Qwen3-VL-8B-Thinking **Method:** - 108 prompts based on version v0 of scientific laws. - 4 rollouts per prompt (432 total). - Tool calling of `run_experiment` enabled and agent loops until law submission. **Results:** **Overall Metrics** - **Total Rollouts:** 432 - **Mean Reward:** $\approx$ 0.0675 - **Median Reward:** 0.0 - **Min Reward:** $\approx$ -0.8786 - **Max Reward:** 1.0 **Tool Call Statistics** - **Average Tool Calls:** 22.95 per rollout - **Min Tool Calls:** 0 - **Max Tool Calls:** 1770 - **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$ -0.0211 (Weak negative correlation) **Reward Distribution (Buckets)** | Reward Range | Count | |:---|:---| | [-1.0, -0.8) | 16 | | [-0.8, -0.6) | 16 | | [-0.6, -0.4) | 60 | | [-0.4, -0.2) | 39 | | [-0.2, 0.0) | 24 | | [0.0, 0.2) | 150 | | [0.2, 0.4) | 46 | | [0.4, 0.6) | 2 | | [0.6, 0.8) | 1 | | [0.8, 1.0] | 78 | **Performance by Tool Call Count Bins** | Tool Call Range | Rollouts (n) | Mean Reward | |:---:|:---:|:---:| | 0 | 23 | $\approx$ -0.1112 | | 1–10 | 329 | $\approx$ 0.0824 | | 11–50 | 60 | $\approx$ 0.1308 | | 51–200 | 15 | $\approx$ -0.1959 | | 201–2000 | 5 | $\approx$ -0.0600 | **Key observations**: - **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior. - **Reward Distribution:** Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries. - **Tool Usage Sweet Spot:** Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws. - **Diminishing Returns:** Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume. --------- Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk> Signed-off-by: cmunley1 <cmunley@nvidia.com> Co-authored-by: Christian Munley <cmunley@nvidia.com>

Signed-off-by: Brian Yu <bxyu@nvidia.com>

fsiino-nvidia linked an issue Feb 23, 2026 that may be closed by this pull request

Pipeclean NeMo RL training with 0.1.1 environments #504

Closed

2 tasks

fsiino-nvidia changed the title ~~Pipecleaning changes for 0.1.1 environments~~ fix: pipecleaning 0.1.1 environments for rl Feb 25, 2026

fsiino-nvidia marked this pull request as ready for review February 26, 2026 01:42

fsiino-nvidia requested a review from bxyu-nvidia February 26, 2026 01:45

Frankie Siino and others added 25 commits February 25, 2026 18:24

calendar, structured_outputs, workplace_assistant

b0a5a49

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Add missing mcqa hf validation

8c70ffc

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Fix crashing from null option values

fd6bc9e

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

equivalence llm judge changes

b2b36a7

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Add preprocess script for rl

a3e01ba

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Remove stale run configs

1ad2eb9

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Prevent individual http call failure from crashing rest of rollout co…

74632bd

…llection Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

More stale removals

06ca2ca

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Revert metrics

d12b6b1

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Revert debugging logic

539e883

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Improve equivalence_llm_judge robustness: truncation, validation erro…

26eed90

…r handling, simplified judge prompt Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Fix math_with_judge answer parsing: strip math delimiters, prefer ext…

4ec339d

…racted answer for judge, add truncation and warmup support Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Fix math_with_code answer extraction: brace-depth boxed parsing, answ…

5ad31b0

…er normalization, numeric fallback, max_steps limit Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

revert math_with_judge warmup and reward inflation logic

e6364a7

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

add explicit warning print for judge response validation fallback

346e101

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

update equivalence judge prompt output format wording

ae529a5

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

tighten math_with_code agent max_steps to reduce stragglers

e911276

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Linting

6958c70

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Small comment revert

eacb585

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Add per-row timeouts to prevent cascading crash

c7bf840

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

Refactor judge envs and configs, small revert

73487e3

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

add agent ref to reasoning gym dataset (#752)

8cb4b81

Signed-off-by: Christian Munley <cmunley@nvidia.com>

feat: install environments serially in dryrun (optimization) (#753)

4463656

Signed-off-by: Terry Kong <terryk@nvidia.com>

bxyu-nvidia added 2 commits February 25, 2026 18:27

feat: CPU Profiling (#763)

ced35db

Signed-off-by: Brian Yu <bxyu@nvidia.com>

Fix offending file name (#766)

0bf68d9

Signed-off-by: Brian Yu <bxyu@nvidia.com>

fsiino-nvidia force-pushed the fsiino/pipecleaning-rl-envs branch from 6ca6a23 to 0bf68d9 Compare February 26, 2026 02:27

fsiino-nvidia requested a review from a team as a code owner February 26, 2026 02:27

fsiino-nvidia closed this Feb 26, 2026

fsiino-nvidia deleted the fsiino/pipecleaning-rl-envs branch February 26, 2026 02:39

fsiino-nvidia removed a link to an issue Feb 27, 2026

Pipeclean NeMo RL training with 0.1.1 environments #504

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pipecleaning 0.1.1 environments for rl#748

fix: pipecleaning 0.1.1 environments for rl#748
fsiino-nvidia wants to merge 27 commits intomainfrom
fsiino/pipecleaning-rl-envs

fsiino-nvidia commented Feb 23, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

fsiino-nvidia commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fsiino-nvidia commented Feb 23, 2026 •

edited

Loading