Conversation
📝 WalkthroughWalkthroughAdds a no-op cleanup hook to CommandGenStrategy, implements NIXL-specific artifact removal, invokes cleanup from Slurm runners on job completion, and adds tests verifying invocation and filesystem cleanup. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Greptile SummaryThis PR adds post-job cleanup of NIXL container-mount directories ( Key changes:
Confidence Score: 4/5
Important Files Changed
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/systems/slurm/single_sbatch_runner.py`:
- Around line 224-225: The method completed_test_runs in single_sbatch_runner.py
currently accepts a parameter named job that is unused (triggering ARG002);
rename the parameter from job to _job in the completed_test_runs signature to
mark it intentionally unused and update any references inside the method (there
are none) so lint stops complaining—keep the body returning list(self.all_trs)
unchanged.
In `@src/cloudai/workloads/common/nixl.py`:
- Around line 235-251: The cleanup currently calls resolve() in
_cleanup_targets() which lets rmtree follow symlinks outside the run dir;
instead, return non-resolved targets (keep (self.test_run.output_path /
"filepath_mount") and "device_list_mounts") from _cleanup_targets(), and in
cleanup_job_artifacts() for each target first check existence, then compute
resolved_root = self.test_run.output_path.resolve() and target_resolved =
target.resolve(); only allow deletion if
target_resolved.is_relative_to(resolved_root); handle symlinks safely by
unlinking when target.is_symlink() (or target.is_file()) and using shutil.rmtree
for real directories, otherwise skip and log/refuse to delete to avoid escaping
the run directory (referencing cleanup_job_artifacts, _cleanup_targets,
test_run.output_path, filepath_mount, device_list_mounts).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6e86174c-28f1-4102-bfb6-d14268f624d8
📒 Files selected for processing (8)
src/cloudai/_core/command_gen_strategy.pysrc/cloudai/systems/slurm/single_sbatch_runner.pysrc/cloudai/systems/slurm/slurm_runner.pysrc/cloudai/workloads/common/nixl.pytests/test_get_job_id.pytests/test_single_sbatch_runner.pytests/workloads/nixl_bench/test_command_gen_strategy_slurm.pytests/workloads/nixl_kvbench/test_command_gen_slurm.py
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/cloudai/workloads/common/nixl.py (1)
235-250:⚠️ Potential issue | 🔴 CriticalPrevent cleanup from deleting paths outside the run directory.
At Line 246 and Line 250, calling
.resolve()before deletion lets symlinks escapeself.test_run.output_path; then Line 238 can recursively delete arbitrary directories. Keep unresolved targets and enforce an output-root boundary before deletion.🔒 Proposed safe cleanup patch
def cleanup_job_artifacts(self) -> None: + output_root = self.test_run.output_path.resolve() for cleanup_target in self._cleanup_targets(): - if cleanup_target.exists(): - shutil.rmtree(cleanup_target) - logging.debug(f"Cleaned up job artifact: {cleanup_target}") + if not cleanup_target.exists() and not cleanup_target.is_symlink(): + continue + + if cleanup_target.is_symlink() or cleanup_target.is_file(): + logging.warning("Skipping non-directory cleanup target: %s", cleanup_target) + cleanup_target.unlink(missing_ok=True) + continue + + resolved_target = cleanup_target.resolve() + if resolved_target != output_root and output_root not in resolved_target.parents: + logging.warning("Skipping cleanup target outside output path: %s", cleanup_target) + continue + + shutil.rmtree(cleanup_target) + logging.debug("Cleaned up job artifact: %s", cleanup_target) def _cleanup_targets(self) -> list[Path]: cleanup_targets: list[Path] = [] filepath_raw: str | None = cast(str | None, self.test_run.test.cmd_args_dict.get("filepath")) if filepath_raw: - cleanup_targets.append((self.test_run.output_path / "filepath_mount").resolve()) + cleanup_targets.append(self.test_run.output_path / "filepath_mount") device_list_raw: str | None = cast(str | None, self.test_run.test.cmd_args_dict.get("device_list")) if device_list_raw and get_files_from_device_list(device_list_raw): - cleanup_targets.append((self.test_run.output_path / "device_list_mounts").resolve()) + cleanup_targets.append(self.test_run.output_path / "device_list_mounts")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cloudai/workloads/common/nixl.py` around lines 235 - 250, The cleanup currently calls .resolve() in _cleanup_targets which can let symlinks escape the intended run directory; update _cleanup_targets (used by cleanup_job_artifacts) to stop resolving targets (return Path objects rooted under self.test_run.output_path without calling .resolve()), and then in cleanup_job_artifacts before calling shutil.rmtree verify each candidate is contained inside the run output root (compare a resolved output root like self.test_run.output_path.resolve() with the candidate's absolute path via a safe common-path/is_relative_to check using resolve(strict=False) or os.path.commonpath) and skip/delete only if the candidate is inside that boundary; this prevents deleting paths outside the run directory while keeping targets unresolved until validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/cloudai/workloads/common/nixl.py`:
- Around line 235-250: The cleanup currently calls .resolve() in
_cleanup_targets which can let symlinks escape the intended run directory;
update _cleanup_targets (used by cleanup_job_artifacts) to stop resolving
targets (return Path objects rooted under self.test_run.output_path without
calling .resolve()), and then in cleanup_job_artifacts before calling
shutil.rmtree verify each candidate is contained inside the run output root
(compare a resolved output root like self.test_run.output_path.resolve() with
the candidate's absolute path via a safe common-path/is_relative_to check using
resolve(strict=False) or os.path.commonpath) and skip/delete only if the
candidate is inside that boundary; this prevents deleting paths outside the run
directory while keeping targets unresolved until validated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 47c06ae3-4e1b-4a53-b5da-c8efe8993cc9
📒 Files selected for processing (1)
src/cloudai/workloads/common/nixl.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/cloudai/workloads/common/nixl.py`:
- Around line 235-239: Static analysis flags the f-string in
cleanup_job_artifacts; inside the loop that iterates over
self._cleanup_targets() and calls shutil.rmtree on cleanup_target (after
cleanup_target.is_dir()), replace the f-string logging.debug(f"Cleaned up job
artifact: {cleanup_target}") with lazy formatting logging.debug("Cleaned up job
artifact: %s", cleanup_target) to avoid eager string interpolation and keep
consistent logging style.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 496ba64f-0519-417c-bb1b-b2f34fa7cc4d
📒 Files selected for processing (1)
src/cloudai/workloads/common/nixl.py
Summary
Cleanup mounted files to nixl after execution to free space
Test Plan
Additional Notes
Additional functionality for CLOUDAI-11