Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Grafana User dashboard #5703

Merged
merged 14 commits into from
Oct 18, 2024

Conversation

davidmirror-ops
Copy link
Contributor

@davidmirror-ops davidmirror-ops commented Aug 28, 2024

Closes #5670

Why are the changes needed?

The user dashboard currently published on Grafana marketplace uses outdated metric names and undocumented dependencies.

What changes were proposed in this pull request?

This PR updates metric names and changes some panel graph types from TimeSeries to BarGauge to better render by (quantile) queries.
Also, it deprecates the use of label_execution_id to mitigate high-cardinality scenarios.

How was this patch tested?

Installed on a Grafana instance provided by the kube-prometheus-stack Helm chart, pointing to a flyte-core cluster with ServiceMonitor enabled.

image

Setup process

  1. Installed the kube-prometheus-stack using the Helm chart making sure the following block is added to the values file:
kube-state-metrics:
  metricLabelsAllowlist:
    - pods=[node-id,workflow-name,task-name]

Without it, those 3 labels won't be available and they are used for filtering on multiple metrics.
2. Configured flyte-core with the following sections as part of the values file:

flyteadmin:
  serviceMonitor:
    enabled: true
    labels:
      release: kube-prometheus-stack #this is particular to the kube-prometheus-stack
  selectorLabels:
    - app.kubernetes.io/instance: flyte-coretf #this is typically the name of the Helm release
    - app.kubernetes.io/name: flyteadmin
flytepropeller:
  service:
    enabled: true
  serviceMonitor:
    enabled: true
    labels:
      release: kube-prometheus-stack
  selectorLabels:
    - app.kubernetes.io/instance: flyte-coretf #Helm release name
    - app.kubernetes.io/name: flytepropeller 
  1. Import the contents from flyteuser-dashboard.json into Grafana

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Once this is approved, I can proceed to upload the updated dashboard to Grafana Marketplace, updating the docs (including the link to the marketplace content)

@davidmirror-ops davidmirror-ops changed the title Fix Accepted wf metric and succesful wf Update Grafana User dashboard Aug 28, 2024
Copy link

codecov bot commented Aug 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.31%. Comparing base (30d3314) to head (81ef23c).
Report is 260 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5703      +/-   ##
==========================================
+ Coverage   35.90%   36.31%   +0.40%     
==========================================
  Files        1301     1305       +4     
  Lines      109419   110019     +600     
==========================================
+ Hits        39287    39949     +662     
+ Misses      66035    65914     -121     
- Partials     4097     4156      +59     
Flag Coverage Δ
unittests-datacatalog 51.37% <ø> (ø)
unittests-flyteadmin 55.62% <ø> (+1.88%) ⬆️
unittests-flytecopilot 12.17% <ø> (ø)
unittests-flytectl 62.21% <ø> (-0.07%) ⬇️
unittests-flyteidl 7.12% <ø> (+0.03%) ⬆️
unittests-flyteplugins 53.35% <ø> (+0.03%) ⬆️
unittests-flytepropeller 41.89% <ø> (+0.13%) ⬆️
unittests-flytestdlib 55.37% <ø> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@charliemoriarty
Copy link

Thanks for working on these updates!

This is already looking a lot better - success metrics are working well, though we still see a few small issues on the imported dashboard:

  • failed workflows aren't shown even though we did observe failures in this time period. I believe the stats for these workflows are being recorded against failure_duration_unlabeled_ms_count instead of failure_duration_ms_count
  • failed workflow execution time is showing in the order of weeks - this might need a unit of milliseconds instead of seconds
  • the three panels in the task stats group all show an error: "many-to-many matching not allowed: matching labels must be unique on one side"
Screenshot 2024-09-13 at 10 38 08

@davidmirror-ops
Copy link
Contributor Author

davidmirror-ops commented Sep 13, 2024

@charliemoriarty thanks so much for reporting.
I'm still exploring the Failed Workflows panel as it uses the flyte:propeller:all:workflow:event_recording:failure_duration_ms_count metric and it works when I test it standalone but for some reason is not displayed correctly.
Also I'll see why are you getting that error on Task stats, it works on my flyte-binary environment:

image

I forgot to mention that with kube-state-metrics v2.0.0 or higher, you have to enable the required labels for this to work. If you're using the Helm chart it's a matter of upgrading with a values file including:

kube-state-metrics:
  metricLabelsAllowlist:
    - pods=[node-id,workflow-name,task-name]

Once I have updates on this, will push them here for you to test.

Thanks!

Signed-off-by: davidmirror-ops <[email protected]>
Signed-off-by: davidmirror-ops <[email protected]>
Signed-off-by: davidmirror-ops <[email protected]>
@davidmirror-ops davidmirror-ops marked this pull request as ready for review September 24, 2024 17:24
@davidmirror-ops
Copy link
Contributor Author

@charliemoriarty I'm curious how this dashboard looks in your environment after recent changes.

@Tom-Newton
Copy link
Contributor

Thanks for updating this. It looks good to me but I haven't looked particularly carefully. The updates I made were mostly focused on the admin and propeller dashboards so don't worry about stepping on my changes.

@davidmirror-ops davidmirror-ops enabled auto-merge (squash) October 15, 2024 10:15
@davidmirror-ops davidmirror-ops merged commit a10905a into flyteorg:master Oct 18, 2024
54 checks passed
@davidmirror-ops davidmirror-ops mentioned this pull request Oct 23, 2024
3 tasks
@charliemoriarty
Copy link

@charliemoriarty I'm curious how this dashboard looks in your environment after recent changes.

@davidmirror-ops Thanks again for the work on this! Apologies for the very, very late reply - somehow I missed the notification for the thread at the time 🤦 I can confirm that these changes look a lot better, and that failure visualisation is working as expected. The updates to the documentation are really helpful as well - I think we are missing the metricLabelsAllowList which is why the task visualisations weren't working as expected. Hopefully this helps others get set up out of the box!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Flyte user dashboard metric name mismatches
3 participants