Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes

This DRA resource driver is currently under active development and not yet designed for production use. We may (at times) decide to push commits over main until we have something more stable. Use at your own risk.

A document and demo of the DRA support for GPUs provided by this repo can be found below:

Document	Demo

Demo

This section describes using kind to demo the functionality of the NVIDIA GPU DRA Driver.

First since we'll launch kind with GPU support, ensure that the following prerequisites are met:

kind is installed. See the official documentation here.
Ensure that the NVIDIA Container Toolkit is installed on your system. This can be done by following the instructions here.

Configure the NVIDIA Container Runtime as the default Docker runtime:

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default

Restart Docker to apply the changes:
```
sudo systemctl restart docker
```
Set the accept-nvidia-visible-devices-as-volume-mounts option to true in the /etc/nvidia-container-runtime/config.toml file to configure the NVIDIA Container Runtime to use volume mounts to select devices to inject into a container.
```
sudo nvidia-ctk config --in-place --set accept-nvidia-visible-devices-as-volume-mounts=true
```
Show the current set of GPUs on the machine:
```
nvidia-smi -L
```

We start by first cloning this repository and cding into it. All of the scripts and example Pod specs used in this demo are in the demo subdirectory, so take a moment to browse through the various files and see what's available:

git clone https://github.com/NVIDIA/k8s-dra-driver.git

cd k8s-dra-driver

Setting up the infrastructure

Here's a demo showing how to install and configure DRA, and run a pod in a kind cluster on a Linux workstation.

Below are the detailed, step-by-step instructions.

First, create a kind cluster to run the demo:

./demo/clusters/kind/create-cluster.sh

From here we will build the image for the example resource driver:

./demo/clusters/kind/build-dra-driver.sh

This also makes the built images available to the kind cluster.

We now install the NVIDIA GPU DRA driver:

./demo/clusters/kind/install-dra-driver.sh

This should show two pods running in the nvidia namespace:

kubectl get pods -n nvidia

NAME                                                          READY   STATUS    RESTARTS   AGE
nvidia-dra-driver-k8s-dra-driver-controller-844fcb94b-ktbkc   1/1     Running   0          69s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-5vfp9         1/1     Running   0          69s

Run the examples by following the steps in the demo script

Finally, you can run the various examples contained in the demo/specs/quickstart folder. With the most recent updates for Kubernetes v1.31, only the first 3 examples in this folder are currently functional.

You can run them as follows:

kubectl apply --filename=demo/specs/quickstart/gpu-test{1,2,3}.yaml

Get the pods' statuses. Depending on which GPUs are available, running the first three examples will produce output similar to the following...

Note: there is a known issue with kind. You may see an error while trying to tail the log of a running pod in the kind cluster: failed to create fsnotify watcher: too many open files. The issue may be resolved by increasing the value for fs.inotify.max_user_watches.

kubectl get pod -A -l app=pod

NAMESPACE           NAME                                       READY   STATUS    RESTARTS   AGE
gpu-test1           pod1                                       1/1     Running   0          34s
gpu-test1           pod2                                       1/1     Running   0          34s
gpu-test2           pod                                        2/2     Running   0          34s
gpu-test3           pod1                                       1/1     Running   0          34s
gpu-test3           pod2                                       1/1     Running   0          34s

kubectl logs -n gpu-test1 -l app=pod

GPU 0: A100-SXM4-40GB (UUID: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2)
GPU 0: A100-SXM4-40GB (UUID: GPU-4cf8db2d-06c0-7d70-1a51-e59b25b2c16c)

kubectl logs -n gpu-test2 pod --all-containers

GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)
GPU 0: A100-SXM4-40GB (UUID: GPU-79a2ba02-a537-ccbf-2965-8e9d90c0bd54)

kubectl logs -n gpu-test3 -l app=pod

GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)
GPU 0: A100-SXM4-40GB (UUID: GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c)

Cleaning up the environment

Remove the cluster created in the preceding steps:

./demo/clusters/kind/delete-cluster.sh

Name		Name	Last commit message	Last commit date
Latest commit History 485 Commits
.github		.github
api/nvidia.com/resource/gpu/v1alpha1		api/nvidia.com/resource/gpu/v1alpha1
cmd		cmd
demo		demo
deployments		deployments
hack		hack
internal/info		internal/info
pkg/flags		pkg/flags
templates		templates
vendor		vendor
.common-ci.yml		.common-ci.yml
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.golangci.yaml		.golangci.yaml
.nvidia-ci.yml		.nvidia-ci.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
common.mk		common.mk
go.mod		go.mod
go.sum		go.sum
versions.mk		versions.mk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes

Demo

Setting up the infrastructure

Run the examples by following the steps in the demo script

Cleaning up the environment

About

Releases

Packages

Contributors 12

Languages

License

NVIDIA/k8s-dra-driver

Folders and files

Latest commit

History

Repository files navigation

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes

Demo

Setting up the infrastructure

Run the examples by following the steps in the demo script

Cleaning up the environment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Languages

Packages