Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu-operator on talos 1.9 alpha #527

Open
Hexoplon opened this issue Nov 19, 2024 · 1 comment
Open

gpu-operator on talos 1.9 alpha #527

Hexoplon opened this issue Nov 19, 2024 · 1 comment

Comments

@Hexoplon
Copy link
Contributor

I'm running the v24.9.0 release of the Nvidia GPU Operator, and attempted to install Talos 1.9.0-alpha.2 on my nodes (from 1.8.2). However, it is now unable to find and validate the drivers. I previously had to make some custom modifications to the operator validator logic to make it search under /glibc/lib, relative to the driverInstallDir, but these no longer help either.

These are the driverInstallDir values I have tried, with no success:

  • /run/nvidia/driver (the default one from Nvidia)
  • /usr/local
  • /usr/local/glibc
  • /usr/local/glibc/usr

From browsing the talos filesystem, as far as I can tell, nvidia-smi and other executables are located in /usr/local/bin, while all the libraries now are located under /usr/lib/glibc/lib64, and symlinked to a few other places as well.

As the Nvidia components do not search glibc by default, I cannot see what value of driverInstallDir that would currently allow these components to find both the libraries, as well as the required binaries. (example discovery logic in the gpu operator validator https://github.com/NVIDIA/gpu-operator/blob/79b1240221f22bbbc60c6c4b659aace48f0b3f42/validator/find.go#L35, also see a few lines below for discovery of the binaries)

From the description of c7eb377, it seemed like it should "just work" now with the gpu operator. Any pointers as to what I might be doing wrong?

@Hexoplon
Copy link
Contributor Author

I see now why my earlier modifications no longer work, as the libraries are not found in /usr/local/glibc/lib any more, as this folder is not symlinked like the others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant