fix: pipecleaning 0.1.1 environments for rl#748
Closed
fsiino-nvidia wants to merge 27 commits intomainfrom
Closed
Conversation
2 tasks
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…llection Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…r handling, simplified judge prompt Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…racted answer for judge, add truncation and warmup support Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…er normalization, numeric fallback, max_steps limit Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@cw-dfw-cs-001-login-01.cm.cluster> Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
issue described in #736 collate_samples writes metrics without the dataset config metadata eg name type jsonl_fpath while validate_samples_and_aggregate_emtrics wrote those with metadata , causing a conflict. this mirrors what the validate step does to merge the metadata before the conflict check Signed-off-by: cmunley1 <cmunley@nvidia.com>
Signed-off-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
# Contributing To NeMo-Gym (NewtonBench Resource Server) ## 1) Basic information ### i. Description of the environment A resource server wrapping the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark - **Tasks:** 324 scientific law discovery tasks across 12 physics domains. - **Observation Space:** Experimental results (numeric or structured dictionaries) returned after tool use. - **Tools:** - `run_experiment`: Query the environment with specific parameters to receive physical observations. - `execute_python`: (Optional) Python code-assisted discovery for complex data analysis. - **Server:** FastAPI resource server following NeMo Gym conventions. ### ii. Description of the verification logic The verifier uses the NewtonBench evaluation suite to score the agent's proposed scientific law: - **Law Extraction:** Attempts to find a law within `<final_law>` tags in the assistant's final response. - **Success Criteria:** Evaluates both **symbolic equivalence** (via an LLM judge) and **numeric accuracy** (Root Mean Square Logarithmic Error - RMSLE). - **Reward Calculation:** - `reward = 0.3 * R_symbolic + 0.7 * R_numeric`. - $R_{symbolic}$ is 1.0 if equivalent, -1.0 otherwise. - $R_{numeric} = 1.0 - (2.0 * \text{RMSLE} / (\text{RMSLE} + 3.0))$, yielding a score in $(-1, 1]$. - `/verify` endpoint processes the agent's submission and returns these detailed performance metrics. ### iii. Description of the prompts/tasks (source + domain) **Domain:** Maths (Scientific Law Discovery). **Source:** Tasks and prompts adapted from the [NewtonBench](https://github.com/HKUST-KnowComp/NewtonBench) benchmark, which instruct the agent to discover a specific shifted scientific law (e.g., Newton's Law of Gravitation, Snell's Law) by performing interactive experiments. ### iv. License information - **Code:** Apache 2.0. - **Data:** Apache 2.0 - **NewtonBench Benchmark:** MIT (Copyright (c) 2025 HKUST-KnowComp). ## 2) Environment validity check ### i. Commands used to collect rollouts ```bash # Start NeMo Gym servers (agent + NewtonBench) config_paths="resources_servers/newton_bench/configs/newton_bench.yaml,\ responses_api_models/vllm_model/configs/vllm_model.yaml" ng_run "+config_paths=[$config_paths]" # Collect sample rollouts ng_collect_rollouts \ +agent_name=newton_bench_simple_agent \ +input_jsonl_fpath=resources_servers/newton_bench/data/example.jsonl \ +output_jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl \ +limit=5 # View rollouts ng_viewer +jsonl_fpath=resources_servers/newton_bench/data/example_rollouts.jsonl ``` ### ii. Resulting rollouts (5 examples) See `resources_servers/newton_bench/data/example_rollouts.jsonl` **Expected behavior:** - Agent performs several experiments, analyzes data, and submits a scientific law. - Successful discovery $\rightarrow$ positive reward ($\approx$ 1.0). - Failed discovery $\rightarrow$ reward $\approx$ 0.0 or negative. ## 3) Tests ### i. Commands used to run the tests ```bash source resources_servers/newton_bench/.venv/bin/activate pytest resources_servers/newton_bench/tests/test_app.py ``` **Coverage notes:** Resource server tests provide comprehensive coverage of the following areas: - **Session Lifecycle:** Successful seeding, error handling for invalid modules, session ending, and background cleanup. - **Experiment Execution:** Dynamic handler registration for each modules, basic run experiment execution, and error handling for uninitialized sessions, mismatched module calls, etc. - **Python Sandbox:** Basic execution, session-based code persistence, timeout enforcement, and security validation (restricting dangerous imports/operations). - **Verification Logic:** Law extraction from diverse response structures, and reward calculation via symbolic equivalence (LLM judge) and numeric RMSLE. ## 4) Reward profiling **Models:** Qwen/Qwen3-VL-8B-Thinking **Method:** - 108 prompts based on version v0 of scientific laws. - 4 rollouts per prompt (432 total). - Tool calling of `run_experiment` enabled and agent loops until law submission. **Results:** **Overall Metrics** - **Total Rollouts:** 432 - **Mean Reward:** $\approx$ 0.0675 - **Median Reward:** 0.0 - **Min Reward:** $\approx$ -0.8786 - **Max Reward:** 1.0 **Tool Call Statistics** - **Average Tool Calls:** 22.95 per rollout - **Min Tool Calls:** 0 - **Max Tool Calls:** 1770 - **Correlation (tool calls $\leftrightarrow$ reward):** $\approx$ -0.0211 (Weak negative correlation) **Reward Distribution (Buckets)** | Reward Range | Count | |:---|:---| | [-1.0, -0.8) | 16 | | [-0.8, -0.6) | 16 | | [-0.6, -0.4) | 60 | | [-0.4, -0.2) | 39 | | [-0.2, 0.0) | 24 | | [0.0, 0.2) | 150 | | [0.2, 0.4) | 46 | | [0.4, 0.6) | 2 | | [0.6, 0.8) | 1 | | [0.8, 1.0] | 78 | **Performance by Tool Call Count Bins** | Tool Call Range | Rollouts (n) | Mean Reward | |:---:|:---:|:---:| | 0 | 23 | $\approx$ -0.1112 | | 1–10 | 329 | $\approx$ 0.0824 | | 11–50 | 60 | $\approx$ 0.1308 | | 51–200 | 15 | $\approx$ -0.1959 | | 201–2000 | 5 | $\approx$ -0.0600 | **Key observations**: - **Symbolic Accuracy:** Approximately 19.7% symbolic accuracy and a widespread RMSLE distribution indicate frequent failures to recover exact symbolic forms or precise numeric behavior. - **Reward Distribution:** Rewards cluster near zero (median 0.0, mean ~0.0675) with a long tail and many negative outcomes, reflecting frequent partial or failed discoveries. - **Tool Usage Sweet Spot:** Positive performance is observed with moderate tool use (1–50 calls), with a peak in the 11–50 range, suggesting that tool-driven data collection is critical for inducing scientific laws. - **Diminishing Returns:** Performance declines sharply after 50 calls, showing additional tool calls become detrimental and successful discovery depends on reasoning and hypothesis selection rather than raw data volume. --------- Signed-off-by: Kelvin0110 <kwtamai@connect.ust.hk> Signed-off-by: cmunley1 <cmunley@nvidia.com> Co-authored-by: Christian Munley <cmunley@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
Signed-off-by: Brian Yu <bxyu@nvidia.com>
6ca6a23 to
0bf68d9
Compare
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See #768
This PR contains infra + environment-specific changes made during pipecleaning efforts to get the respective environments to run successfully in nemo-rl.rollout_collection:Any server crash, network timeout, model error or resource exhaustion can cause a rollout to fail. Any single failed row kills the entire batch. We return a zero-reward fallback after waiting for
ROLLOUT_ROW_TIMEOUT_SECONDS.equivalence_llm_judge:Previously any overly long generated answers from a thinking model would overflow the judge model's context window, crash the judge call and lead to cascading rollout failures. The implementation allows opt-in via
max_judge_input_tokensandchars_per_token_estimateto truncate the generated answer before passing it to the judge model.math_with_code:Adds fallback for \boxed{} answer extraction. Previously only assistant text messages were searched for the answer. If a model were to answer via code execution, it ended up getting a zero reward. This fallback also searches
function_call_outputso an answer can be extracted from the tool output if the model prints it inside the executed code.math_with_judge: ~~
- Add config fields for judge truncation, similar toequivalence_llm_judge.- Similar to math_with_code:~~ - Send extracted answer parsed from
\boxed{}to judge model instead of the raw generated answer to avoid overwhelming the judge's context window.~~ - Expected answers wrapped in
\(...\)produce\boxed{\42\)}which is not parsable by the math_verify library. This strips the outer delimiters to fix parsing.~~- Prevent crash if the extracted answer is empty.mcqa:Address null option values in the dataset. Previously these were skipped which led to invalid rows being collected and training crashes.
mini_swe_agent:Add a standalone data processing script to convert raw SWE-Bench/SWE-Gym data into the nemo-gym training format.