Open
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
pyproject.toml
Outdated
| "docs: mark tests related to documentation (deselect with '-m \"not docs\"')", | ||
| "skipduringci: marks tests that are skipped during ci as they are addressed by Jenkins jobs but should be run to test user setups", | ||
| "pleasefixme: marks tests that are broken and need fixing", | ||
| "e2e: end-to-end tests requiring downloaded databases and Spider2 reference repo", |
Contributor
There was a problem hiding this comment.
Can these be pushed down into the spider2_lite requirements.txt instead?
Author
There was a problem hiding this comment.
@jfarris-nvidia yes, these are also mainly for my own pipe cleaning as well. They can be removed from the final PR if deemed unnecessary.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Execution-based text-to-SQL evaluation on Spider 2.0-Lite (135 SQLite tasks). Binary reward via result-set column-vector matching against gold SQL or pre-computed gold result CSVs. No LLM judge required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
…n eval_utils - compare_result_sets: filter condition_cols indices that exceed the number of gold SQL result columns (CSV-indexed condition_cols can exceed SQL col count) - compare_multi_result_sets: remove n > 1 guard so flat int lists are treated as shared condition_cols for all gold sets regardless of gold set count - execute_and_compare: route through compare_multi_result_sets to correctly normalize list-of-lists condition_cols for single-gold comparisons - Add e2e oracle test suite (tests/test_e2e_oracle.py) and e2e pytest marker All 24 gold SQL oracle tasks now pass (was 18/24). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Starts a local vLLM server (or uses SPIDER2_LLM_URL if pre-running), calls the model for each of the 5 example tasks, and verifies the extracted SQL through the Spider2-Lite resource server. Each task is a separate parametrized test that asserts the pipeline produces a valid verdict with no unknown_error and SQL extraction succeeds. Default model: openai/gpt-oss-20b (already cached locally). Runtime: ~60s on a single RTX PRO 6000 Blackwell. Usage: pytest tests/test_e2e_llm.py -m e2e_llm -v -s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
- Add test_e2e_chain.py: canonical NeMo-Gym chain test using ng_run +
ng_collect_rollouts subprocesses to exercise the full server topology
(HeadServer + resource server + agent + vLLM model server)
- Move vllm_url and llm_model fixtures to conftest.py (session-scoped,
shared across test_e2e_llm.py and test_e2e_chain.py)
- Fix license: "CC BY-SA 4.0" -> "Creative Commons Attribution-ShareAlike
4.0 International" (ng_run validates against an allowlist)
Note: run via the server venv directly to avoid Ray's uv runtime-env hook:
resources_servers/spider2_lite/.venv/bin/pytest \
resources_servers/spider2_lite/tests/test_e2e_chain.py -m e2e_llm -v -s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Public benchmarks use HuggingFace (huggingface_identifier), NVIDIA-internal benchmarks use GitLab (gitlab_identifier). GitLab infrastructure is not publicly accessible. Verified field names and CLI args against source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Remove machine-specific fallback path from conftest.py vllm binary lookup. The e2e tests already skip when vllm is not found on PATH. Also fix the None guard in the vllm_url fixture is_file() check. Update SKILL.md reward profiling section with clearer model selection guidance, variance requirements, and sanity check expectations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Add spider2_lite_sqlite_validation dataset entry (135 tasks) with gitlab_identifier for download. The validation JSONL has been generated and uploaded separately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Reverts the public/internal dataset split added in 7c6849a. Spider 2.0-Lite uses GitLab dataset registry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
… conftest Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Fix import ordering (I001), remove unused queue import (F401), rename ambiguous variable l -> line (E741), apply ruff format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Rebases onto main (picks up #757 which enabled all environments in docs) and runs update_resource_servers.py to add spider2_lite to the README table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
cd2f022 to
36028c0
Compare
…tion time Wraps _load_oracle_tasks() in try/except so pytest collection succeeds in CI where the spider2-lite reference repo is not present. e2e tests are still skipped via pytest.mark.e2e when not in the right environment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
E2E tests require the Spider2 reference repo and/or a local vLLM server, neither of which are available in CI. Correctness is documented via the oracle test results in the PR. Prune dead fixtures from conftest.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a resource server for the Spider 2.0-Lite benchmark, covering the 135 local SQLite tasks.
evaluate.pyalgorithmgold_sql(execute gold and predicted SQL, compare results) andgold_result(compare predicted SQL execution against pre-computed gold result sets from the Spider2 repo)setup_spider2.pyng_prepare_dataflow verifiedScoring notes
This server covers the 135 SQLite tasks from Spider 2.0-Lite. Published leaderboard scores cover all 547 tasks (BigQuery + Snowflake + SQLite) using a multi-turn agent, so no direct comparison is possible. Scorer correctness is validated by the oracle test: all 24 tasks with public gold SQL return reward=1.0. DeepSeek-R1's published score on the full benchmark is ~13.7%; our SQLite-subset pass@1 of 12.6% is consistent.
Reward profiling
Single-turn agent (
spider2_lite_simple_agent), 135 tasks, temperature=1.0, max_output_tokens=16384.pass@1 = mean reward per attempt. pass@k = fraction of tasks solved at least once in k attempts.
Rollouts collected with:
ng_collect_rollouts +agent_name=spider2_lite_simple_agent \ +input_jsonl_fpath=resources_servers/spider2_lite/data/spider2_lite_sqlite_validation.jsonl \ +output_jsonl_fpath=results/rollouts.jsonl \ +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"Test plan
TestEvalUtils,TestExtractSqlReuse,TestSpider2LiteServerUnit,TestSetupSpider2)ng_prepare_data +mode=example_validationpasses