Skip to content

docs: add GPU Operator production troubleshooting runbook#2198

Open
framsouza wants to merge 2 commits intoNVIDIA:mainfrom
framsouza:gpu-troubleshooting-runbook
Open

docs: add GPU Operator production troubleshooting runbook#2198
framsouza wants to merge 2 commits intoNVIDIA:mainfrom
framsouza:gpu-troubleshooting-runbook

Conversation

@framsouza
Copy link

This PR adds a production troubleshooting runbook for common operational issues I encountered when running the NVIDIA GPU Operator in Kubernetes clusters.

The runbook includes step-by-step debugging guidance for situations such as:

  • GPUs not appearing on Kubernetes nodes
  • GPU workloads stuck in Pending
  • driver daemonset failures
  • device plugin initialization errors
  • missing DCGM exporter metrics
  • MIG configuration issues

Each section includes example commands and mock outputs to help operators quickly identify common failure patterns.

GPU clusters can be challenging to troubleshoot when components such as drivers, container runtime integration, or the device plugin fail to initialize correctly.

This runbook aims to provide a structured debugging flow that platform engineers and SREs can follow during incidents (also if you're curious about the operator).

The content is based on past troubleshooting experiences operating GPU workloads on Kubernetes clusters using the GPU Operator.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@framsouza framsouza force-pushed the gpu-troubleshooting-runbook branch from 2075c4f to 6986bcc Compare March 8, 2026 19:13
The guide covers scenarios such as:
- GPUs not detected on nodes
- GPU workloads stuck in Pending
- driver daemonset failures
- device plugin initialization issues
- missing DCGM exporter metrics
- MIG configuration problems

It also includes example debugging commands and sample outputs to help
operators quickly recognize common failure patterns.

The content is based on past operational troubleshooting experiences running
GPU workloads on Kubernetes clusters using the GPU Operator.

Signed-off-by: framsouza <fram.souza14@gmail.com>
Signed-off-by: framsouza <fram.souza14@gmail.com>
@framsouza framsouza force-pushed the gpu-troubleshooting-runbook branch from b34e577 to 4832201 Compare March 11, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant