docs: add GPU Operator production troubleshooting runbook by framsouza · Pull Request #2198 · NVIDIA/gpu-operator

framsouza · 2026-03-08T16:34:30Z

This PR adds a production troubleshooting runbook for common operational issues I encountered when running the NVIDIA GPU Operator in Kubernetes clusters.

The runbook includes step-by-step debugging guidance for situations such as:

GPUs not appearing on Kubernetes nodes
GPU workloads stuck in Pending
driver daemonset failures
device plugin initialization errors
missing DCGM exporter metrics
MIG configuration issues

Each section includes example commands and mock outputs to help operators quickly identify common failure patterns.

GPU clusters can be challenging to troubleshoot when components such as drivers, container runtime integration, or the device plugin fail to initialize correctly.

This runbook aims to provide a structured debugging flow that platform engineers and SREs can follow during incidents (also if you're curious about the operator).

The content is based on past troubleshooting experiences operating GPU workloads on Kubernetes clusters using the GPU Operator.

copy-pr-bot · 2026-03-08T16:34:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The guide covers scenarios such as: - GPUs not detected on nodes - GPU workloads stuck in Pending - driver daemonset failures - device plugin initialization issues - missing DCGM exporter metrics - MIG configuration problems It also includes example debugging commands and sample outputs to help operators quickly recognize common failure patterns. The content is based on past operational troubleshooting experiences running GPU workloads on Kubernetes clusters using the GPU Operator. Signed-off-by: framsouza <fram.souza14@gmail.com>

Signed-off-by: framsouza <fram.souza14@gmail.com>

framsouza requested review from cdesiniotis, karthikvetrivel, rahulait, rajathagasthya, shivamerla and tariq1890 as code owners March 8, 2026 16:34

framsouza force-pushed the gpu-troubleshooting-runbook branch from 2075c4f to 6986bcc Compare March 8, 2026 19:13

framsouza added 2 commits March 11, 2026 10:49

docs: adding MIG debug inspired on NVIDIA/issues/2155

4832201

Signed-off-by: framsouza <fram.souza14@gmail.com>

framsouza force-pushed the gpu-troubleshooting-runbook branch from b34e577 to 4832201 Compare March 11, 2026 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add GPU Operator production troubleshooting runbook#2198

docs: add GPU Operator production troubleshooting runbook#2198
framsouza wants to merge 2 commits intoNVIDIA:mainfrom
framsouza:gpu-troubleshooting-runbook

framsouza commented Mar 8, 2026

Uh oh!

copy-pr-bot bot commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

framsouza commented Mar 8, 2026

Uh oh!

copy-pr-bot bot commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant