feat: add spider2_lite resource server by ryan-lempka · Pull Request #769 · NVIDIA-NeMo/Gym

ryan-lempka · 2026-02-26T16:59:47Z

Summary

Adds a resource server for the Spider 2.0-Lite benchmark, covering the 135 local SQLite tasks.

Execution-based text-to-SQL evaluation with binary reward (1.0 = correct, 0.0 = incorrect)
Result-set comparison uses column-vector matching, mirroring the official evaluate.py algorithm
Two verification modes: gold_sql (execute gold and predicted SQL, compare results) and gold_result (compare predicted SQL execution against pre-computed gold result sets from the Spider2 repo)
SQLite databases auto-downloaded from Google Drive on first startup via setup_spider2.py
Validation dataset (135 tasks) uploaded to GitLab registry; ng_prepare_data flow verified
No LLM judge required

Scoring notes

This server covers the 135 SQLite tasks from Spider 2.0-Lite. Published leaderboard scores cover all 547 tasks (BigQuery + Snowflake + SQLite) using a multi-turn agent, so no direct comparison is possible. Scorer correctness is validated by the oracle test: all 24 tasks with public gold SQL return reward=1.0. DeepSeek-R1's published score on the full benchmark is ~13.7%; our SQLite-subset pass@1 of 12.6% is consistent.

Reward profiling

Single-turn agent (spider2_lite_simple_agent), 135 tasks, temperature=1.0, max_output_tokens=16384.
pass@1 = mean reward per attempt. pass@k = fraction of tasks solved at least once in k attempts.

Model	Repeats	pass@1	pass@k	pass@1 variance
DeepSeek-R1-Distill-Qwen-32B	5	12.6%	33.3% (k=5)	+/-1.28%
Nemotron-3-Nano-30B-A3B	5	20.1%	36.3% (k=5)	+/-1.54%
Qwen3-30B-A3B-Thinking-2507	5	21.6%	33.3% (k=5)	+/-1.58%
Qwen3-30B-A3B-Thinking-2507 (extended)	13	22.4%	41.5% (k=13)	+/-1.00%

Rollouts collected with:

ng_collect_rollouts +agent_name=spider2_lite_simple_agent \
    +input_jsonl_fpath=resources_servers/spider2_lite/data/spider2_lite_sqlite_validation.jsonl \
    +output_jsonl_fpath=results/rollouts.jsonl \
    +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"

Test plan

46 unit tests passing (TestEvalUtils, TestExtractSqlReuse, TestSpider2LiteServerUnit, TestSetupSpider2)
Pre-commit clean (ruff, format, add-verified-flag, update-readme-table)
Oracle test: all 24 tasks with public gold SQL return reward=1.0
ng_prepare_data +mode=example_validation passes
Validation dataset (135 tasks) uploaded to GitLab registry and download verified
Reward profiling complete on 3 models, pass@1 variance <1% on Qwen3 (13 repeats)

copy-pr-bot · 2026-02-26T16:59:51Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

jfarris-nvidia · 2026-02-26T19:33:58Z

pyproject.toml

    "docs: mark tests related to documentation (deselect with '-m \"not docs\"')",
    "skipduringci: marks tests that are skipped during ci as they are addressed by Jenkins jobs but should be run to test user setups",
    "pleasefixme: marks tests that are broken and need fixing",
+    "e2e: end-to-end tests requiring downloaded databases and Spider2 reference repo",


Can these be pushed down into the spider2_lite requirements.txt instead?

@jfarris-nvidia yes, these are also mainly for my own pipe cleaning as well. They can be removed from the final PR if deemed unnecessary.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

Execution-based text-to-SQL evaluation on Spider 2.0-Lite (135 SQLite tasks). Binary reward via result-set column-vector matching against gold SQL or pre-computed gold result CSVs. No LLM judge required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

…n eval_utils - compare_result_sets: filter condition_cols indices that exceed the number of gold SQL result columns (CSV-indexed condition_cols can exceed SQL col count) - compare_multi_result_sets: remove n > 1 guard so flat int lists are treated as shared condition_cols for all gold sets regardless of gold set count - execute_and_compare: route through compare_multi_result_sets to correctly normalize list-of-lists condition_cols for single-gold comparisons - Add e2e oracle test suite (tests/test_e2e_oracle.py) and e2e pytest marker All 24 gold SQL oracle tasks now pass (was 18/24). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

Starts a local vLLM server (or uses SPIDER2_LLM_URL if pre-running), calls the model for each of the 5 example tasks, and verifies the extracted SQL through the Spider2-Lite resource server. Each task is a separate parametrized test that asserts the pipeline produces a valid verdict with no unknown_error and SQL extraction succeeds. Default model: openai/gpt-oss-20b (already cached locally). Runtime: ~60s on a single RTX PRO 6000 Blackwell. Usage: pytest tests/test_e2e_llm.py -m e2e_llm -v -s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

- Add test_e2e_chain.py: canonical NeMo-Gym chain test using ng_run + ng_collect_rollouts subprocesses to exercise the full server topology (HeadServer + resource server + agent + vLLM model server) - Move vllm_url and llm_model fixtures to conftest.py (session-scoped, shared across test_e2e_llm.py and test_e2e_chain.py) - Fix license: "CC BY-SA 4.0" -> "Creative Commons Attribution-ShareAlike 4.0 International" (ng_run validates against an allowlist) Note: run via the server venv directly to avoid Ray's uv runtime-env hook: resources_servers/spider2_lite/.venv/bin/pytest \ resources_servers/spider2_lite/tests/test_e2e_chain.py -m e2e_llm -v -s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

Public benchmarks use HuggingFace (huggingface_identifier), NVIDIA-internal benchmarks use GitLab (gitlab_identifier). GitLab infrastructure is not publicly accessible. Verified field names and CLI args against source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

Remove machine-specific fallback path from conftest.py vllm binary lookup. The e2e tests already skip when vllm is not found on PATH. Also fix the None guard in the vllm_url fixture is_file() check. Update SKILL.md reward profiling section with clearer model selection guidance, variance requirements, and sanity check expectations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

Add spider2_lite_sqlite_validation dataset entry (135 tasks) with gitlab_identifier for download. The validation JSONL has been generated and uploaded separately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

Reverts the public/internal dataset split added in 7c6849a. Spider 2.0-Lite uses GitLab dataset registry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

… conftest Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

Fix import ordering (I001), remove unused queue import (F401), rename ambiguous variable l -> line (E741), apply ruff format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

Rebases onto main (picks up #757 which enabled all environments in docs) and runs update_resource_servers.py to add spider2_lite to the README table. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

…tion time Wraps _load_oracle_tasks() in try/except so pytest collection succeeds in CI where the spider2-lite reference repo is not present. e2e tests are still skipped via pytest.mark.e2e when not in the right environment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

E2E tests require the Spider2 reference repo and/or a local vLLM server, neither of which are available in CI. Correctness is documented via the oracle test results in the PR. Prune dead fixtures from conftest.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

ryan-lempka temporarily deployed to public February 26, 2026 17:35 — with GitHub Actions Inactive

ryan-lempka temporarily deployed to public February 26, 2026 17:47 — with GitHub Actions Inactive

ryan-lempka temporarily deployed to public February 26, 2026 19:18 — with GitHub Actions Inactive

ryan-lempka temporarily deployed to public February 26, 2026 19:32 — with GitHub Actions Inactive

jfarris-nvidia reviewed Feb 26, 2026

View reviewed changes

ryan-lempka temporarily deployed to public March 2, 2026 23:56 — with GitHub Actions Inactive

ryan-lempka temporarily deployed to public March 2, 2026 23:57 — with GitHub Actions Inactive

ryan-lempka temporarily deployed to public March 3, 2026 02:44 — with GitHub Actions Inactive

ryan-lempka self-assigned this Mar 3, 2026

ryan-lempka temporarily deployed to public March 3, 2026 02:57 — with GitHub Actions Inactive

ryan-lempka temporarily deployed to public March 3, 2026 03:04 — with GitHub Actions Inactive

ryan-lempka marked this pull request as ready for review March 3, 2026 04:12

ryan-lempka temporarily deployed to public March 3, 2026 04:13 — with GitHub Actions Inactive

ryan-lempka and others added 8 commits March 2, 2026 20:22

chore: add spider2_lite skeleton with gitignore

c73710a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

ryan-lempka and others added 6 commits March 2, 2026 20:22

revert: restore CLAUDE.md dataset storage guidance to prior state

34477b5

Reverts the public/internal dataset split added in 7c6849a. Spider 2.0-Lite uses GitLab dataset registry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

fix: move e2e pytest markers from root pyproject.toml to spider2_lite…

c3c057d

… conftest Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

revert: restore SKILL.md to main

1b84294

Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

style: apply ruff-format to conftest.py

2794053

Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

fix: fix ruff lint errors in e2e test files

e330b8f

Fix import ordering (I001), remove unused queue import (F401), rename ambiguous variable l -> line (E741), apply ruff format. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

ryan-lempka force-pushed the rlempka/spider2-lite-sqlite branch from cd2f022 to 36028c0 Compare March 3, 2026 04:22

ryan-lempka temporarily deployed to public March 3, 2026 04:23 — with GitHub Actions Inactive

ryan-lempka temporarily deployed to public March 3, 2026 04:30 — with GitHub Actions Inactive

ryan-lempka temporarily deployed to public March 3, 2026 04:34 — with GitHub Actions Inactive

feat: add example_rollouts.jsonl for spider2_lite data validation

017406b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

ryan-lempka temporarily deployed to public March 3, 2026 04:56 — with GitHub Actions Inactive

fix: commit example_metrics.json for data validation CI check

c3fdabc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Ryan Lempka <rlempka@nvidia.com>

ryan-lempka temporarily deployed to public March 3, 2026 05:05 — with GitHub Actions Inactive

ryan-lempka requested review from bxyu-nvidia and dhruvnathawani March 3, 2026 05:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add spider2_lite resource server#769

feat: add spider2_lite resource server#769
ryan-lempka wants to merge 18 commits intomainfrom
rlempka/spider2-lite-sqlite

ryan-lempka commented Feb 26, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 26, 2026

Uh oh!

jfarris-nvidia Feb 26, 2026

Uh oh!

ryan-lempka Feb 26, 2026

Uh oh!

ryan-lempka Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ryan-lempka commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scoring notes

Reward profiling

Test plan

Uh oh!

copy-pr-bot bot commented Feb 26, 2026

Uh oh!

jfarris-nvidia Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

ryan-lempka Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

ryan-lempka Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ryan-lempka commented Feb 26, 2026 •

edited

Loading