Skip to content

feat: add spider2_lite resource server#769

Open
ryan-lempka wants to merge 18 commits intomainfrom
rlempka/spider2-lite-sqlite
Open

feat: add spider2_lite resource server#769
ryan-lempka wants to merge 18 commits intomainfrom
rlempka/spider2-lite-sqlite

Conversation

@ryan-lempka
Copy link

@ryan-lempka ryan-lempka commented Feb 26, 2026

Summary

Adds a resource server for the Spider 2.0-Lite benchmark, covering the 135 local SQLite tasks.

  • Execution-based text-to-SQL evaluation with binary reward (1.0 = correct, 0.0 = incorrect)
  • Result-set comparison uses column-vector matching, mirroring the official evaluate.py algorithm
  • Two verification modes: gold_sql (execute gold and predicted SQL, compare results) and gold_result (compare predicted SQL execution against pre-computed gold result sets from the Spider2 repo)
  • SQLite databases auto-downloaded from Google Drive on first startup via setup_spider2.py
  • Validation dataset (135 tasks) uploaded to GitLab registry; ng_prepare_data flow verified
  • No LLM judge required

Scoring notes

This server covers the 135 SQLite tasks from Spider 2.0-Lite. Published leaderboard scores cover all 547 tasks (BigQuery + Snowflake + SQLite) using a multi-turn agent, so no direct comparison is possible. Scorer correctness is validated by the oracle test: all 24 tasks with public gold SQL return reward=1.0. DeepSeek-R1's published score on the full benchmark is ~13.7%; our SQLite-subset pass@1 of 12.6% is consistent.

Reward profiling

Single-turn agent (spider2_lite_simple_agent), 135 tasks, temperature=1.0, max_output_tokens=16384.
pass@1 = mean reward per attempt. pass@k = fraction of tasks solved at least once in k attempts.

Model Repeats pass@1 pass@k pass@1 variance
DeepSeek-R1-Distill-Qwen-32B 5 12.6% 33.3% (k=5) +/-1.28%
Nemotron-3-Nano-30B-A3B 5 20.1% 36.3% (k=5) +/-1.54%
Qwen3-30B-A3B-Thinking-2507 5 21.6% 33.3% (k=5) +/-1.58%
Qwen3-30B-A3B-Thinking-2507 (extended) 13 22.4% 41.5% (k=13) +/-1.00%

Rollouts collected with:

ng_collect_rollouts +agent_name=spider2_lite_simple_agent \
    +input_jsonl_fpath=resources_servers/spider2_lite/data/spider2_lite_sqlite_validation.jsonl \
    +output_jsonl_fpath=results/rollouts.jsonl \
    +num_repeats=5 "+responses_create_params={max_output_tokens: 16384, temperature: 1.0}"

Test plan

  • 46 unit tests passing (TestEvalUtils, TestExtractSqlReuse, TestSpider2LiteServerUnit, TestSetupSpider2)
  • Pre-commit clean (ruff, format, add-verified-flag, update-readme-table)
  • Oracle test: all 24 tasks with public gold SQL return reward=1.0
  • ng_prepare_data +mode=example_validation passes
  • Validation dataset (135 tasks) uploaded to GitLab registry and download verified
  • Reward profiling complete on 3 models, pass@1 variance <1% on Qwen3 (13 repeats)

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 26, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

pyproject.toml Outdated
"docs: mark tests related to documentation (deselect with '-m \"not docs\"')",
"skipduringci: marks tests that are skipped during ci as they are addressed by Jenkins jobs but should be run to test user setups",
"pleasefixme: marks tests that are broken and need fixing",
"e2e: end-to-end tests requiring downloaded databases and Spider2 reference repo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be pushed down into the spider2_lite requirements.txt instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jfarris-nvidia yes, these are also mainly for my own pipe cleaning as well. They can be removed from the final PR if deemed unnecessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

ryan-lempka and others added 8 commits March 2, 2026 20:22
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Execution-based text-to-SQL evaluation on Spider 2.0-Lite (135 SQLite tasks).
Binary reward via result-set column-vector matching against gold SQL or
pre-computed gold result CSVs. No LLM judge required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
…n eval_utils

- compare_result_sets: filter condition_cols indices that exceed the number
  of gold SQL result columns (CSV-indexed condition_cols can exceed SQL col count)
- compare_multi_result_sets: remove n > 1 guard so flat int lists are treated
  as shared condition_cols for all gold sets regardless of gold set count
- execute_and_compare: route through compare_multi_result_sets to correctly
  normalize list-of-lists condition_cols for single-gold comparisons
- Add e2e oracle test suite (tests/test_e2e_oracle.py) and e2e pytest marker

All 24 gold SQL oracle tasks now pass (was 18/24).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Starts a local vLLM server (or uses SPIDER2_LLM_URL if pre-running),
calls the model for each of the 5 example tasks, and verifies the
extracted SQL through the Spider2-Lite resource server. Each task
is a separate parametrized test that asserts the pipeline produces
a valid verdict with no unknown_error and SQL extraction succeeds.

Default model: openai/gpt-oss-20b (already cached locally).
Runtime: ~60s on a single RTX PRO 6000 Blackwell.

Usage: pytest tests/test_e2e_llm.py -m e2e_llm -v -s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
- Add test_e2e_chain.py: canonical NeMo-Gym chain test using ng_run +
  ng_collect_rollouts subprocesses to exercise the full server topology
  (HeadServer + resource server + agent + vLLM model server)
- Move vllm_url and llm_model fixtures to conftest.py (session-scoped,
  shared across test_e2e_llm.py and test_e2e_chain.py)
- Fix license: "CC BY-SA 4.0" -> "Creative Commons Attribution-ShareAlike
  4.0 International" (ng_run validates against an allowlist)

Note: run via the server venv directly to avoid Ray's uv runtime-env hook:
  resources_servers/spider2_lite/.venv/bin/pytest \
    resources_servers/spider2_lite/tests/test_e2e_chain.py -m e2e_llm -v -s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Public benchmarks use HuggingFace (huggingface_identifier), NVIDIA-internal
benchmarks use GitLab (gitlab_identifier). GitLab infrastructure is not
publicly accessible. Verified field names and CLI args against source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Remove machine-specific fallback path from conftest.py vllm binary lookup.
The e2e tests already skip when vllm is not found on PATH.
Also fix the None guard in the vllm_url fixture is_file() check.

Update SKILL.md reward profiling section with clearer model selection
guidance, variance requirements, and sanity check expectations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Add spider2_lite_sqlite_validation dataset entry (135 tasks) with
gitlab_identifier for download. The validation JSONL has been generated
and uploaded separately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
ryan-lempka and others added 6 commits March 2, 2026 20:22
Reverts the public/internal dataset split added in 7c6849a.
Spider 2.0-Lite uses GitLab dataset registry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
… conftest

Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Fix import ordering (I001), remove unused queue import (F401),
rename ambiguous variable l -> line (E741), apply ruff format.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Rebases onto main (picks up #757 which enabled all environments in docs)
and runs update_resource_servers.py to add spider2_lite to the README table.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
…tion time

Wraps _load_oracle_tasks() in try/except so pytest collection succeeds in CI
where the spider2-lite reference repo is not present. e2e tests are still
skipped via pytest.mark.e2e when not in the right environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
E2E tests require the Spider2 reference repo and/or a local vLLM server,
neither of which are available in CI. Correctness is documented via the
oracle test results in the PR. Prune dead fixtures from conftest.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Ryan Lempka <rlempka@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants