Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HPU] [Train] Add a Stable Diffusion fine-tuning and serving example #45217

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

kira-lin
Copy link
Contributor

@kira-lin kira-lin commented May 9, 2024

Why are these changes needed?

This PR adds an example for stable diffusion model fine-tuning and serving using HPU. Moreover, it also covers how to adapt an existing HPU example to run on Ray, so that users can use Ray to run the examples on huggingface/optimum-habana.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(for train/tune team to review)

@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) train Ray Train Related Issue labels May 9, 2024
@woshiyyya woshiyyya self-assigned this May 9, 2024
@woshiyyya woshiyyya removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label May 9, 2024
" main()\n",
"```\n",
"\n",
"Originally, this script will be started by MPI if multiple workers are used. But with Ray, we should setup TorchTrainer and supply a main function, which is `main()` in this example.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love these step-by-step explanations! They are crystal clear:)

doc/source/train/examples/intel_gaudi/sd.ipynb Outdated Show resolved Hide resolved
@woshiyyya
Copy link
Member

woshiyyya commented May 10, 2024

Let's also add this example to doc/source/train/examples.yml. You can refer to this PR: https://github.com/ray-project/ray/pull/44667/files.

Also, could you include the required libraries with versions, so that users can reproduce on their own?

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @kira-lin ! Generally looks good to me. Left some comments.

Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please adopt the suggested changes to pass the doc build.

doc/source/train/examples.yml Show resolved Hide resolved
@woshiyyya
Copy link
Member

Screenshot 2024-05-16 at 1 08 50 PM

Seems to have a syntax error in the notebook. Can you fix that?

Copy link
Contributor

@peytondmurray peytondmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of things:

  1. You start sd.ipynb with (hpu_sd_finetune)=. I'm not sure what is intended here, but this tells Sphinx to create a new anchor at the top of the page. IMO this is clutter, because sphinx already lets you reference the file itself automatically: instead of referencing hpu_sd_finetune, just make a reference to train/examples/intel_gaudi/sd and sphinx will create a link to the file. Users who click on the link will be taken to the documentation built from this file: docs.ray.io/en/master/train/examples/intel_gaudi/sd.html; your current approach will create a link to docs.ray.io/en/master/train/examples/intel_gaudi/sd.html#hpu_sd_finetune, with the anchor being at the top of the page. Please see the sphinx docs for more information about references.

  2. Any reason this needs to be invoked in the shell (i.e. via !<command>?

!python ~/optimum-habana/examples/stable-diffusion/training/textual_inversion.py \
  --pretrained_model_name_or_path runwayml/stable-diffusion-v1-5 \
  --train_data_dir "/root/cat" \
  --learnable_property object \
  --placeholder_token "<cat-toy>" \
  --initializer_token toy \
  --resolution 512 \
  --train_batch_size 4 \
  --max_train_steps 3000 \
  --learning_rate 5.0e-04 \
  --scale_lr \
  --lr_scheduler constant \
  --lr_warmup_steps 0 \
  --output_dir /tmp/textual_inversion_cat \
  --save_as_full_pipeline \
  --gaudi_config_name Habana/stable-diffusion \
  --throughput_warmup_steps 3

!python ~/... requires the user to be running something bash-like (to have ~ expansion - not sure if this works on windows). You might want to use a different cell magic here, which might solve the lexing issue as well.

  1. The other option that might be able to help here is to set the cell metadata - there may be a way to override the way this is rendered with myst_nb.

  2. Another option would be to put the options on one line, although this does have the impact of ruining the nice spacing that you have here.

  3. The cell output is huge, 3000 lines of terminal progress bars. We should probably not have this appear in our documentation. You can remove cell output by setting cell metadata - see https://myst-nb.readthedocs.io/en/latest/render/format_code_cells.html.

Alternatively, we probably want to make use of nbstripout-fast as part of pre-commit hooks, but maybe this is a separate issue.

@woshiyyya
Copy link
Member

woshiyyya commented May 17, 2024

Hi @peytondmurray , it's me that added this (hpu_sd_finetune)= tag following our old doc style. Sorry for the confusion, we can remove it if it's no longer applicable now.

For (5), @kira-lin let's remove the outputs and only show the result object in a new cell.

@kira-lin kira-lin requested a review from woshiyyya May 22, 2024 01:36
@woshiyyya woshiyyya added the go Trigger full test run on premerge label May 22, 2024
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triggered a premerge run!

Also cc @justinvyu to take a look and merge it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go Trigger full test run on premerge train Ray Train Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants