Skip to content

fix: pipecleaning 0.1.1 environments for rl#748

Closed
fsiino-nvidia wants to merge 27 commits intomainfrom
fsiino/pipecleaning-rl-envs
Closed

fix: pipecleaning 0.1.1 environments for rl#748
fsiino-nvidia wants to merge 27 commits intomainfrom
fsiino/pipecleaning-rl-envs

Conversation

@fsiino-nvidia
Copy link
Contributor

@fsiino-nvidia fsiino-nvidia commented Feb 23, 2026

See #768

This PR contains infra + environment-specific changes made during pipecleaning efforts to get the respective environments to run successfully in nemo-rl.

rollout_collection:
Any server crash, network timeout, model error or resource exhaustion can cause a rollout to fail. Any single failed row kills the entire batch. We return a zero-reward fallback after waiting for ROLLOUT_ROW_TIMEOUT_SECONDS.

equivalence_llm_judge:
Previously any overly long generated answers from a thinking model would overflow the judge model's context window, crash the judge call and lead to cascading rollout failures. The implementation allows opt-in via max_judge_input_tokens and chars_per_token_estimate to truncate the generated answer before passing it to the judge model.

math_with_code:
Adds fallback for \boxed{} answer extraction. Previously only assistant text messages were searched for the answer. If a model were to answer via code execution, it ended up getting a zero reward. This fallback also searches function_call_output so an answer can be extracted from the tool output if the model prints it inside the executed code.

math_with_judge: ~~
- Add config fields for judge truncation, similar to equivalence_llm_judge.
- Similar to math_with_code:
~~ - Send extracted answer parsed from \boxed{} to judge model instead of the raw generated answer to avoid overwhelming the judge's context window.

~~ - Expected answers wrapped in \(...\) produce \boxed{\42\)} which is not parsable by the math_verify library. This strips the outer delimiters to fix parsing.~~
- Prevent crash if the extracted answer is empty.

mcqa:
Address null option values in the dataset. Previously these were skipped which led to invalid rows being collected and training crashes.

mini_swe_agent:
Add a standalone data processing script to convert raw SWE-Bench/SWE-Gym data into the nemo-gym training format.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@fsiino-nvidia fsiino-nvidia linked an issue Feb 23, 2026 that may be closed by this pull request
2 tasks
@fsiino-nvidia fsiino-nvidia changed the title Pipecleaning changes for 0.1.1 environments fix: pipecleaning 0.1.1 environments for rl Feb 25, 2026
@fsiino-nvidia fsiino-nvidia marked this pull request as ready for review February 26, 2026 01:42
Frankie Siino and others added 25 commits February 25, 2026 18:24
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…llection

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…r handling, simplified judge prompt

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…racted answer for judge, add truncation and warmup support

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…er normalization, numeric fallback, max_steps limit

Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
issue described in #736

collate_samples writes metrics without the dataset config metadata eg
name type jsonl_fpath while validate_samples_and_aggregate_emtrics wrote
those with metadata , causing a conflict. this mirrors what the validate
step does to merge the metadata before the conflict check

Signed-off-by: cmunley1 <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
# Contributing To NeMo-Gym (NewtonBench Resource Server)

## 1) Basic information

### i. Description of the environment
A resource server wrapping the
[NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark
- **Tasks:** 324 scientific law discovery tasks across 12 physics
domains.
- **Observation Space:** Experimental results (numeric or structured
dictionaries) returned after tool use.
- **Tools:**
- `run_experiment`: Query the environment with specific parameters to
receive physical observations.
- `execute_python`: (Optional) Python code-assisted discovery for
complex data analysis.
- **Server:** FastAPI resource server following NeMo Gym conventions.

### ii. Description of the verification logic
The verifier uses the NewtonBench evaluation suite to score the agent's
proposed scientific law:
- **Law Extraction:** Attempts to find a law within `<final_law>` tags
in the assistant's final response.
- **Success Criteria:** Evaluates both **symbolic equivalence** (via an
LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error
- RMSLE).
- **Reward Calculation:**
  - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`.
    - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise.
- $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$,
yielding a score in $(-1, 1]$.
- `/verify` endpoint processes the agent's submission and returns these
detailed performance metrics.

### iii. Description of the prompts/tasks (source + domain)
**Domain:** Maths (Scientific Law Discovery).
**Source:** Tasks and prompts adapted from the
[NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark,
which instruct the agent to discover a specific shifted scientific law
(e.g., Newton's Law of Gravitation, Snell's Law) by performing
interactive experiments.

### iv. License information
- **Code:** Apache 2.0.
- **Data:** Apache 2.0
- **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp).

## 2) Environment validity check

### i. Commands used to collect rollouts
```bash
# Start NeMo Gym servers (agent + NewtonBench)
config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\
responses_api_models/vllm_model/configs/vllm_model.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect sample rollouts
ng_collect_rollouts \
    +agent_name=newton_bench_simple_agent \
    +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \
    +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \
    +limit=5

# View rollouts
ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl
```

### ii. Resulting rollouts (5 examples)
See `resources_servers/newton_bench/data/example_rollouts.jsonl`
**Expected behavior:**
- Agent performs several experiments, analyzes data, and submits a
scientific law.
- Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0).
- Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative.

## 3) Tests

### i. Commands used to run the tests
```bash
source resources_servers/newton_bench/.venv/bin/activate
pytest resources_servers/newton_bench/tests/test_app.py
```

**Coverage notes:**
Resource server tests provide comprehensive coverage of the following
areas:
- **Session Lifecycle:** Successful seeding, error handling for invalid
modules, session ending, and background cleanup.
- **Experiment Execution:** Dynamic handler registration for each
modules, basic run experiment execution, and error handling for
uninitialized sessions, mismatched module calls, etc.
- **Python Sandbox:** Basic execution, session-based code persistence,
timeout enforcement, and security validation (restricting dangerous
imports/operations).
- **Verification Logic:** Law extraction from diverse response
structures, and reward calculation via symbolic equivalence (LLM judge)
and numeric RMSLE.

## 4) Reward profiling

**Models:** Qwen/Qwen3-VL-8B-Thinking

**Method:**
- 108 prompts based on version v0 of scientific laws.
- 4 rollouts per prompt (432 total).
- Tool calling of `run_experiment` enabled and agent loops until law
submission.

**Results:**
**Overall Metrics**
- **Total Rollouts:** 432
- **Mean Reward:** $\approx$ 0.0675
- **Median Reward:** 0.0
- **Min Reward:** $\approx$ -0.8786
- **Max Reward:** 1.0

**Tool Call Statistics**
- **Average Tool Calls:** 22.95 per rollout
- **Min Tool Calls:** 0
- **Max Tool Calls:** 1770
- **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$
-0.0211 (Weak negative correlation)

**Reward Distribution (Buckets)**
| Reward Range | Count |
|:---|:---|
| [-1.0, -0.8) | 16 |
| [-0.8, -0.6) | 16 |
| [-0.6, -0.4) | 60 |
| [-0.4, -0.2) | 39 |
| [-0.2, 0.0) | 24 |
| [0.0, 0.2) | 150 |
| [0.2, 0.4) | 46 |
| [0.4, 0.6) | 2 |
| [0.6, 0.8) | 1 |
| [0.8, 1.0] | 78 |

**Performance by Tool Call Count Bins**
| Tool Call Range | Rollouts (n) | Mean Reward |
|:---:|:---:|:---:|
| 0 | 23 | $\approx$ -0.1112 |
| 1–10 | 329 | $\approx$ 0.0824 |
| 11–50 | 60 | $\approx$ 0.1308 |
| 51–200 | 15 | $\approx$ -0.1959 |
| 201–2000 | 5 | $\approx$ -0.0600 |

**Key observations**:
- **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a
widespread RMSLE distribution indicate frequent failures to recover
exact symbolic forms or precise numeric behavior.
- **Reward Distribution:** Rewards cluster near zero (median 0.0, mean
~0.0675) with a long tail and many negative outcomes, reflecting
frequent partial or failed discoveries.
- **Tool Usage Sweet Spot:** Positive performance is observed with
moderate tool use (1–50 calls), with a peak in the 11–50 range,
suggesting that tool-driven data collection is critical for inducing
scientific laws.
- **Diminishing Returns:** Performance declines sharply after 50 calls,
showing additional tool calls become detrimental and successful
discovery depends on reasoning and hypothesis selection rather than raw
data volume.

---------

Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk>
Signed-off-by: cmunley1 <cmunley@nvidia.com>
Co-authored-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
@fsiino-nvidia fsiino-nvidia force-pushed the fsiino/pipecleaning-rl-envs branch from 6ca6a23 to 0bf68d9 Compare February 26, 2026 02:27
@fsiino-nvidia fsiino-nvidia requested a review from a team as a code owner February 26, 2026 02:27
@fsiino-nvidia fsiino-nvidia deleted the fsiino/pipecleaning-rl-envs branch February 26, 2026 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants