Avoid real system calls by amaslenn · Pull Request #842 · NVIDIA/cloudai

amaslenn · 2026-03-20T10:31:09Z

Summary

Avoid real system calls. Most of them were related to supports_gpu_directives_cache == None. It slows down tests on systems with slurm controller and in agents' sandbox.

Test Plan

CI

Additional Notes

–

coderabbitai · 2026-03-20T10:31:24Z

📝 Walkthrough

Walkthrough

Added GPU-directives cache initialization to the SlurmSystem pytest fixture and updated SlurmSystem tests: refactored one test to avoid a fixture, adjusted mocks, initialized cache state in a test, and added a new test for probe-error fallback behavior and logging.

Changes

Cohort / File(s)	Summary
Fixture change `tests/conftest.py`	Set `supports_gpu_directives_cache = True` on the `slurm_system` pytest fixture.
Slurm system tests `tests/systems/slurm/test_system.py`	Removed use of the `grouped_nodes` fixture in one test, replaced with local empty mapping and mocks for `SlurmSystem.update` and `SlurmSystem.group_nodes_by_state`; reset `supports_gpu_directives_cache` before asserting behavior; added `test_supports_gpu_directives_defaults_to_true_on_probe_error` which patches `fetch_command_output`, asserts fallback to `True`, cache set to `True`, warning log content, and that `fetch_command_output` was called with `"scontrol show config"`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐇 I nudged the cache, a tiny tweak,
Tests hopped forward, brave not meek.
When probes grumble and logs complain,
I default True — hop on the train! 🥕✨

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Avoid real system calls' directly relates to the main objective of the pull request, which is to eliminate real system calls in tests, particularly those related to GPU directives caching.
Description check	✅ Passed	The description explains that the PR avoids real system calls related to `supports_gpu_directives_cache == None`, which matches the changes shown in the raw summary where the fixture is configured with cache defaults and tests are mocked.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch am/slurm

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can use your project's `ruff` configuration to improve the quality of Python code reviews.

Add a Ruff configuration file to your project to customize how CodeRabbit runs ruff.

greptile-apps · 2026-03-20T10:33:28Z

Greptile Summary

This PR improves test isolation by eliminating real Slurm system calls that were firing during the test suite — especially on systems with an accessible Slurm controller or inside sandboxed CI agents. The root cause was that supports_gpu_directives_cache defaulted to None on the shared slurm_system fixture, causing the supports_gpu_directives property to trigger an actual scontrol show config invocation, and get_available_nodes_from_group would call update() → sinfo/squeue.

Key changes:

tests/conftest.py: Pre-seeds supports_gpu_directives_cache = True in the shared slurm_system fixture so the cache is always warm; tests that specifically exercise the property logic now explicitly reset it to None.
tests/systems/slurm/test_system.py: Refactors test_get_available_nodes_exceeding_limit_no_callstack to patch SlurmSystem.update and SlurmSystem.group_nodes_by_state, converting it from a test that depended on live Slurm infrastructure into a fully isolated unit test with proper mock assertions. Adds a new test (test_supports_gpu_directives_defaults_to_true_on_probe_error) covering the existing error-handling path that defaults to True when scontrol returns a non-empty stderr. Also removes the legacy typing.Dict/typing.List imports in favour of built-in generic aliases.

Confidence Score: 5/5

This PR is safe to merge — it only modifies test infrastructure with no changes to production code.
All changes are confined to the test suite. The fixture pre-seeding is opt-out (tests that need live cache logic reset it to None explicitly). The new mocks correctly intercept the two methods that previously made real system calls. The new test adds useful coverage for an existing error-handling branch. The only minor concern is an overly permissive kwargs assertion, which is a style issue and not a correctness bug.
No files require special attention.

Important Files Changed

Filename	Overview
tests/conftest.py	Adds `supports_gpu_directives_cache = True` to the shared `slurm_system` fixture to prevent real `scontrol show config` calls in any test that consumes the fixture.
tests/systems/slurm/test_system.py	Refactors `test_get_available_nodes_exceeding_limit_no_callstack` to mock `update` and `group_nodes_by_state`, removing the unused `grouped_nodes` parameter; resets the cache to `None` in the GPU-directive tests that need live property logic; adds a new test for the error-handling path; cleans up legacy `typing` imports.

Sequence Diagram

sequenceDiagram
    participant Test
    participant slurm_system fixture
    participant SlurmSystem
    participant OS (scontrol/sinfo/squeue)

    Note over Test,OS (scontrol/sinfo/squeue): BEFORE this PR
    Test->>slurm_system fixture: request fixture (cache = None)
    Test->>SlurmSystem: get_available_nodes_from_group(...)
    SlurmSystem->>OS (scontrol/sinfo/squeue): sinfo / squeue (real call)
    OS (scontrol/sinfo/squeue)-->>SlurmSystem: result
    Test->>SlurmSystem: supports_gpu_directives (cache miss)
    SlurmSystem->>OS (scontrol/sinfo/squeue): scontrol show config (real call)
    OS (scontrol/sinfo/squeue)-->>SlurmSystem: result

    Note over Test,OS (scontrol/sinfo/squeue): AFTER this PR
    Test->>slurm_system fixture: request fixture (cache = True)
    Test->>SlurmSystem: get_available_nodes_from_group(...)
    SlurmSystem->>SlurmSystem: mock_update (no-op)
    SlurmSystem->>SlurmSystem: mock_group_nodes_by_state (returns empty)
    Test->>SlurmSystem: supports_gpu_directives (cache hit → True)
    Note over OS (scontrol/sinfo/squeue): No real OS calls made

Comments Outside Diff (1)

tests/systems/slurm/test_system.py, line 167-168 (link)

Overly permissive kwargs assertion

group_nodes_by_state currently has the signature (self, partition_name: str, group_name: str) — no exclude_nodes parameter exists. Because the production code calls it with only positional arguments, kwargs will always be {}. Allowing {"exclude_nodes": None} as an alternative makes the assertion vacuously pass any future call that passes exclude_nodes=None, silently masking a regression if the call site changes in an unexpected way.

Tightening this to check for exactly {} would make the test a stricter guard:

If exclude_nodes support is intentionally planned for a future change, a brief comment explaining this would help future readers.

_{Last reviewed commit: "address linting issu..."}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/systems/slurm/test_system.py (1)
19-19: ⚠️ Potential issue | 🟡 Minor

Fix CI lint failure by removing unused typing imports.

Line 19 currently keeps Dict/List unused, and CI already reports ruff-check auto-removing it. Please commit that cleanup so pre-commit passes.
Suggested diff
-from typing import Dict, List
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/systems/slurm/test_system.py` at line 19, Remove the unused typing
imports by deleting "Dict" and "List" from the typing import (the line "from
typing import Dict, List"); if no other names are needed from typing, remove the
entire import line, otherwise only keep the used symbols; save and run the
linter (ruff/ruff-check) to confirm the unused-import warning is resolved and
commit the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/systems/slurm/test_system.py`:
- Around line 163-169: The test currently calls
slurm_system.get_available_nodes_from_group(partition_name, group_name,
num_nodes) but discards its return; update the test to capture the return value
and assert it equals [] (empty list) to lock down fallback behavior — e.g., call
get_available_nodes_from_group(...) into a variable like result and assert
result == [] while keeping the existing mock and log assertions (refer to
get_available_nodes_from_group, partition_name, group_name, num_nodes).

---

Outside diff comments:
In `@tests/systems/slurm/test_system.py`:
- Line 19: Remove the unused typing imports by deleting "Dict" and "List" from
the typing import (the line "from typing import Dict, List"); if no other names
are needed from typing, remove the entire import line, otherwise only keep the
used symbols; save and run the linter (ruff/ruff-check) to confirm the
unused-import warning is resolved and commit the change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 45a3e744-8c4b-44bb-9e08-c015ab3b1879

📥 Commits

Reviewing files that changed from the base of the PR and between ef83120 and 9929317.

📒 Files selected for processing (2)

tests/conftest.py
tests/systems/slurm/test_system.py

tests/systems/slurm/test_system.py

coderabbitai

♻️ Duplicate comments (1)

tests/systems/slurm/test_system.py (1)

162-171: ⚠️ Potential issue | 🟡 Minor

Assert fallback return value in the no-callstack test.

Line 162 calls get_available_nodes_from_group(...) but discards the returned value. Capture it and assert [] so the fallback contract is explicitly locked.

Suggested diff

-        slurm_system.get_available_nodes_from_group(partition_name, group_name, num_nodes)
+        nodes = slurm_system.get_available_nodes_from_group(partition_name, group_name, num_nodes)
+        assert nodes == []

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/systems/slurm/test_system.py` around lines 162 - 171, Capture the
return value from the call to get_available_nodes_from_group(partition_name,
group_name, num_nodes) and assert it equals [] to lock the fallback contract;
e.g., assign the call to a variable (e.g., result) immediately after the
invocation and add assert result == [] while keeping the existing mock
assertions and log-message assertion intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/systems/slurm/test_system.py`:
- Around line 162-171: Capture the return value from the call to
get_available_nodes_from_group(partition_name, group_name, num_nodes) and assert
it equals [] to lock the fallback contract; e.g., assign the call to a variable
(e.g., result) immediately after the invocation and add assert result == []
while keeping the existing mock assertions and log-message assertion intact.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f54482bc-7047-4a12-b00d-ad4f2ea0fdf7

📥 Commits

Reviewing files that changed from the base of the PR and between 9929317 and 398c9ff.

📒 Files selected for processing (1)

tests/systems/slurm/test_system.py

Avoid real system calls

9929317

amaslenn requested review from jeffnvidia and srivatsankrishnan as code owners March 20, 2026 10:31

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

tests/systems/slurm/test_system.py Show resolved Hide resolved

address linting issues

398c9ff

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

amaslenn requested a review from podkidyshev March 20, 2026 10:43

podkidyshev approved these changes Mar 20, 2026

View reviewed changes

amaslenn merged commit 9916d02 into main Mar 20, 2026
5 checks passed

amaslenn deleted the am/slurm branch March 20, 2026 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid real system calls#842

Avoid real system calls#842
amaslenn merged 2 commits intomainfrom
am/slurm

amaslenn commented Mar 20, 2026

Uh oh!

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Mar 20, 2026 •

edited

Loading

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amaslenn commented Mar 20, 2026

Summary

Test Plan

Additional Notes

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

greptile-apps bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

greptile-apps bot commented Mar 20, 2026 •

edited

Loading