-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DCGM exporter doesn't work on the latest version of Bottlerocket AMI #34
Comments
Hey @peter-volkov, |
I appreciate your help. But I do not really care about the version. If you can successfully run DCGM export as a part of amazon-cloudwatch-observability EKS add-on with g5.xlarge on any BottleRocket image -- It will be enough for me. Then I will consider the issue to be my own problem and will debug it myself |
I'm experiencing this as well. I use the official bottlerocket-nvidia AMI The dcgm-exporter enters a CrashLoopBackOff state, with the following logs:
I tried to manually setup the capability under aws-observability helm chart values.yaml, but it seems that the parameter is ignored or doesn't exist:
I tried to make dcgm-exporter work in many ways, by using the official NVIDIA helm charts, also with DataDog Integration, it all fail when dcgm-exporter try to startup on Kubernetes. |
We are aware of the issue and are looking at potential solutions. Will keep you posted on a path forward. |
Facing this issue as well with BottleRocketOS AMI GPU nodes, I don't see a direct way though to disable DCGM exporter from helm chart, can we include conditional variable to disable DCGM Exporter within helm until this is fixed? As of now only way seem to be to delete CRD resource of it |
@dbcelm You can try updating the agent configuration to disable accelerated hardware monitoring which should disable DCGM Exporter in your cluster.
For more details, please check the doc. Please note that disabling GPU monitoring with |
@movence even after this change, I don't see GPU metrics widgets on Container Insights Dashboard |
@dbcelm Are you looking for an option to disable GPU monitoring or to disable DCGM Exporter only for your cluster? GPU widgets in Container Insights Dashboard are displayed conditionally when there are GPU metrics to generate the widgets with. If you followed the instruction above to disable GPU monitoring, there will be no GPU metrics since the flag will disable BOTH DCGM Exporter AND Neuron Monitor. |
@movence I have installed DCGM Exporter helm chart separately on this cluster |
@dbcelm thanks for the information. Here are some possible reasons:
|
@movence I have deployed through Add-On and now DCGM exporter pods are working. But I still don't see GPU matrices on Dashboard. I suspect could it be due to the fact that DCGM Exporter is not running in "hostNetwork: true" mode as I use Cilium CNI. Is there a way I can configure "hostNetwork: true" when deploying through Add-On to test this? |
I'm not sure what is the correct place to report this, please direct me if this is not the correct place.
Goal:
I want to have EKS cluster with working observability, Bottlerocket AMI and GPU-nodes (g5* instances)
I use this helm chart by enabling amazon-cloudwatch-observability EKS add-on for my cluster.
Steps to reproduce:
I guess this is some version incompatibility issue for the DCGM and nvidia driver (being installed to nodes via k8s-device-plugin ).
What should I do to make DCGM exporter work?
The text was updated successfully, but these errors were encountered: