Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 88 additions & 11 deletions gpu-operator/cdi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,15 @@

.. headings # #, * *, =, -, ^, "

############################################################
Container Device Interface (CDI) Support in the GPU Operator
############################################################
#################################################################################
Container Device Interface (CDI) and Node Resource Interface (NRI) Plugin Support
#################################################################################

************************************
About the Container Device Interface
************************************
This page gives an overview of CDI and NRI Plugin support in the GPU Operator.

**************************************
About Container Device Interface (CDI)
**************************************

The `Container Device Interface (CDI) <https://github.com/cncf-tags/container-device-interface/blob/main/SPEC.md>`_
is an open specification for container runtimes that abstracts what access to a device, such as an NVIDIA GPU, means,
Expand All @@ -31,7 +33,7 @@ ensure that a device is available in a container. CDI simplifies adding support
the specification is applicable to all container runtimes that support CDI.

Starting with GPU Operator v25.10.0, CDI is used by default for enabling GPU support in containers running on Kubernetes.
Specifically, CDI support in container runtimes, e.g. containerd and cri-o, is used to inject GPU(s) into workload
Specifically, CDI support in container runtimes, like containerd and cri-o, is used to inject GPU(s) into workload
containers. This differs from prior GPU Operator releases where CDI was used via a CDI-enabled ``nvidia`` runtime class.

If you are upgrading from a version of the GPU Operator prior to v25.10.0, where CDI was disabled by default, and you are upgrading to v25.10.0 or later, where CDI is enabled by default, no configuration changes are required for standard workloads using GPU allocation through the Device Plugin.
Expand All @@ -51,16 +53,15 @@ Examples of GPU Management Containers include monitoring agents and device plugi

It is recommended that ``NVIDIA_VISIBLE_DEVICES`` only be used by management containers.

********************************
Enabling CDI During Installation
********************************
************
Enabling CDI
************

CDI is enabled by default during installation in GPU Operator v25.10.0 and later.
Follow the instructions for installing the Operator with Helm on the :doc:`getting-started` page.

CDI is also enabled by default during a Helm upgrade to GPU Operator v25.10.0 and later.

*******************************
Enabling CDI After Installation
*******************************

Expand Down Expand Up @@ -138,3 +139,79 @@ disable CDI and use the legacy NVIDIA Container Toolkit stack instead with the f
nvidia.com/gpu.deploy.operator-validator=true \
nvidia.com/gpu.present=true \
--overwrite


.. _nri-plugin:

**********************************************
About the Node Resource Interface (NRI) Plugin
**********************************************

Node Resource Interface (NRI) is a standardized interface for plugging in extensions, called NRI Plugins, to OCI-compatible container runtimes like CRI-O and containerd.
NRI Plugins serve as hooks which intercept pod and container lifecycle events and perform functions including inject devices to a container, topology aware placement strategies, and more.
For more details on NRI, refer to the `NRI overview <https://github.com/containerd/nri/tree/main?tab=readme-ov-file#background>`_ in the containerd repository.

When enabled in the GPU Operator, the NRI Plugin is managed by the NVIDIA Container Toolkit andprovides an alternative to the ``nvidia`` runtime class to provision GPU workload pods.
It allows the GPU Operator to extend the container runtime behaviour without modifying the container runtime itself.
This feature also simplifies deployments on platforms like k3s, k0s, or RKE, because the GPU Operator no longer needs you to set values like ``CONTAINERD_CONFIG``, ``CONTAINERD_SOCKET``, or ``RUNTIME_CONFIG_SOURCE`` for the Container Toolkit.

***********************
Enabling the NRI Plugin
***********************

The NRI Plugin requires the following:

- CDI to be enabled in the GPU Operator.

- CRI-O v1.34.0 or later or containerd v1.7.30, v2.1.x, or v2.2.x.
If you are not using the latest containerd version, check that both CDI and NRI are enabled in the containerd configuration file before deploying GPU Operator.

To enable the NRI Plugin during installation, follow the instructions for installing the Operator with Helm on the :doc:`getting-started` page and include the ``--set cdi.nriPluginEnabled=true`` argument in you Helm command.

Enabling the NRI Plugin After Installation
******************************************

#. Enable NRI Plugin by modifying the cluster policy:

.. code-block:: console

$ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
-p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":true}]'

*Example Output*

.. code-block:: output

clusterpolicy.nvidia.com/cluster-policy patched

#. (Optional) Confirm that the container toolkit and device plugin pods restart:

.. code-block:: console

$ kubectl get pods -n gpu-operator

*Example Output*

.. literalinclude:: ./manifests/output/nri-get-pods-restart.txt
:language: output
:emphasize-lines: 6,9


************************
Disabling the NRI Plugin
************************

Disable the NRI Plugin and use the ``nvidia`` runtime class instead with the following procedure:

Disable the NRI Plugin by modifying the cluster policy:

.. code-block:: console

$ kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
-p='[{"op": "replace", "path": "/spec/cdi/nriPluginEnabled", "value":false}]'

*Example Output*

.. code-block:: output

clusterpolicy.nvidia.com/cluster-policy patched
226 changes: 0 additions & 226 deletions gpu-operator/conf.py

This file was deleted.

Loading
Loading