Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Make distributed error aggregation opt-in #6103

Merged
merged 1 commit into from
Dec 13, 2024

Conversation

fg91
Copy link
Member

@fg91 fg91 commented Dec 11, 2024

Why are the changes needed?

For RFC #5598, flytepropeller was given the ability to list error files in the so-called raw output prefix bucket of an execution with the goal of identifying which worker pod in a failed distributed task experienced the first error.

In GCP, listing the error files requires the "storage.objects.list" permission which so far wasn't given to propeller. I added this permission to the Flyte propeller custom role here.

That being said, because this feature is therefore not backwards compatible, I propose to make it opt-in.

If you agree with this, I'll make another PR to document this feature and how to activate it here and/or here.

What changes were proposed in this pull request?

Only search for multiple error files from the different workers of a distributed task as proposed in RFC #5598 if actively enabled in the flytepropeller config in order to not strictly require the addition of the "storage.objects.list" permission.

How was this patch tested?

Ran flytepropeller with/without the flag enabled locally for a GKE based deployment and adapted unit tests.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.99%. Comparing base (4a7f4c2) to head (c6e73c7).
Report is 5 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6103      +/-   ##
==========================================
- Coverage   37.10%   36.99%   -0.11%     
==========================================
  Files        1318     1318              
  Lines      132403   132415      +12     
==========================================
- Hits        49122    48989     -133     
- Misses      79008    79173     +165     
+ Partials     4273     4253      -20     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.10% <ø> (ø)
unittests-flytecopilot 30.99% <ø> (ø)
unittests-flytectl 62.29% <ø> (-0.05%) ⬇️
unittests-flyteidl 7.23% <ø> (ø)
unittests-flyteplugins 53.85% <100.00%> (+0.02%) ⬆️
unittests-flytepropeller 42.60% <ø> (ø)
unittests-flytestdlib 55.18% <ø> (-2.35%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@fg91 fg91 self-assigned this Dec 11, 2024
Copy link
Contributor

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This makes sense, please tag me on the docs PR as well.

@fg91 fg91 merged commit bd12812 into master Dec 13, 2024
50 of 52 checks passed
@fg91 fg91 deleted the fg91/fix/opt-in-dist-error-aggregation branch December 13, 2024 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants