Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for HPC clusters #232

Open
myrelle22 opened this issue Sep 13, 2024 · 2 comments
Open

Adding support for HPC clusters #232

myrelle22 opened this issue Sep 13, 2024 · 2 comments

Comments

@myrelle22
Copy link

I have been searching, without success; for a solution that will support high performance computing clusters. These consists of running 2- many computing clusters, each with multiple gpus.
Would need gpu metrics per server and combined per cluster.
There are numerous people looking for a solution for this.

I love this tool for one servers, but it won't work for clusters. I have yet to find a solution for clusters.

Thank you
Andrea

@utkuozdemir
Copy link
Owner

AFAIK there are places doing exactly that, using this tool on production. Are we talking about Kubernetes clusters? If so, there's the Helm chart you can use: https://github.com/utkuozdemir/helm-charts/tree/master/nvidia-gpu-exporter

The regular setup of kube-prometheus-stack and this chart should work just fine, you need to enable serviceMonitor.enabled=true in the values and you should be good to go.

Even if we are not talking about Kubernetes clusters, you can simply install the exporter to each machine and configure Prometheus to scrape all of them. The exporter exposes metrics for all GPUs on the machine distinctly over their UUID. What is the issue you are facing when using this tool with multiple GPUs and multiple machines?

@myrelle22
Copy link
Author

These are not Kubernetes clusters. This are something similar to beowulf clusters. Multiple servers with gpus with a common file system and a head node for submitting jobs with a tools like slurm scheduler.
This types of clusters are running generative AI and other applications requiring large amounts of computing power.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants