Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support parallel for operations, like data parallel training, model parallel training etc #3102

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented Feb 7, 2023

What changes were proposed in this pull request?

Support ParallelFor pipeline features for each operation. Set parallel_count > 2 to start parallel operations like distributed training, distributed data processing etc. Below are features/limitations:

截屏2023-02-07 10 32 57

  • Suport kfp + Argo currently.
  • Automatically set TF_CONFIG for Tensorflow and MASTER_ADDR, MASTER_PORT for Pytorch. Yet in some cases, workers rank >=1 should wait for rank0 to start. This can be achieved by waiting rank0's TCP server port by user.
  • No Parameter Server style distritubed training support, since it's less popular now.
  • Components/Operations before and after parallelfor operation is supported

截屏2023-02-07 10 32 27

How was this pull request tested?

Unit tests are included in test_bootstrapper.py to ensure argument parallel_count is working.

TODO:

  • support airflow, kfp tekton
  • support parallel operation output file (fetch output only from rank0)
  • add examples to do tensorflow/pytorch/accelerate distributed training

elyra/kfp/bootstrapper.py Fixed Show fixed Hide fixed
@akchinSTC akchinSTC added component:pipeline-editor pipeline editor component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. labels Feb 7, 2023
@typhoonzero typhoonzero force-pushed the support_parallel_for_operations branch from b14c41c to ef1b41f Compare May 5, 2023 10:49
@typhoonzero typhoonzero force-pushed the support_parallel_for_operations branch from 7d371cd to 5cb02ab Compare May 9, 2023 11:01
Signed-off-by: typhoonzero <[email protected]>
@typhoonzero typhoonzero force-pushed the support_parallel_for_operations branch from c6653bb to cb47ff4 Compare May 10, 2023 02:17
Signed-off-by: typhoonzero <[email protected]>
@typhoonzero typhoonzero force-pushed the support_parallel_for_operations branch from cb47ff4 to ee6eb87 Compare May 10, 2023 09:22
@typhoonzero typhoonzero changed the title [WIP]: Support parallel for operations Support parallel for operations, like data parallel training, model parallel training etc May 10, 2023
@typhoonzero
Copy link
Contributor Author

@akchinSTC Can you please checkout this feature

@lresende
Copy link
Member

@akchinSTC Can you please checkout this feature

What is the current status of this PR? I still see the work in progress tag..

@typhoonzero
Copy link
Contributor Author

@akchinSTC Can you please checkout this feature

What is the current status of this PR? I still see the work in progress tag..

This PR is ready for review now.
I removed "WIP" from the title, yet the tag seems still there.

Work under the TODO list will move on after this feature is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:pipeline-editor pipeline editor component:pipeline-runtime issues related to pipeline runtimes e.g. kubeflow pipelines status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants