You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been searching, without success; for a solution that will support high performance computing clusters. These consists of running 2- many computing clusters, each with multiple gpus.
Would need gpu metrics per server and combined per cluster.
There are numerous people looking for a solution for this.
I love this tool for one servers, but it won't work for clusters. I have yet to find a solution for clusters.
Thank you
Andrea
The text was updated successfully, but these errors were encountered:
The regular setup of kube-prometheus-stack and this chart should work just fine, you need to enable serviceMonitor.enabled=true in the values and you should be good to go.
Even if we are not talking about Kubernetes clusters, you can simply install the exporter to each machine and configure Prometheus to scrape all of them. The exporter exposes metrics for all GPUs on the machine distinctly over their UUID. What is the issue you are facing when using this tool with multiple GPUs and multiple machines?
These are not Kubernetes clusters. This are something similar to beowulf clusters. Multiple servers with gpus with a common file system and a head node for submitting jobs with a tools like slurm scheduler.
This types of clusters are running generative AI and other applications requiring large amounts of computing power.
I have been searching, without success; for a solution that will support high performance computing clusters. These consists of running 2- many computing clusters, each with multiple gpus.
Would need gpu metrics per server and combined per cluster.
There are numerous people looking for a solution for this.
I love this tool for one servers, but it won't work for clusters. I have yet to find a solution for clusters.
Thank you
Andrea
The text was updated successfully, but these errors were encountered: