Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial Loading PR3: Integrate 1) partial loading, 2) quantized models, 3) model patching #7500

Merged
merged 31 commits into from
Dec 31, 2024

Conversation

RyanJDick
Copy link
Collaborator

@RyanJDick RyanJDick commented Dec 29, 2024

Summary

This PR is the third in a sequence of PRs working towards support for partial loading of models onto the compute device (for low-VRAM operation). This PR updates the LoRA patching code so that the following features can cooperate fully:

  • Partial loading of weights onto the GPU
  • Quantized layers / weights
  • Model patches (e.g. LoRA)

Note that this PR does not yet enable partial loading. It adds support in the model patching code so that partial loading can be enabled in a future PR.

Technical Design Decisions

The layer patching logic has been integrated into the custom layers (via CustomModuleMixin) rather than keeping it in a separate set of wrapper layers, as before. This has the following advantages:

  • It makes it easier to calculate the modified weights on the fly and then reuse the normal forward() logic.
  • In the future, it makes it possible to pass original parameters that have been cast to the device down to the LoRA calculation without having to re-cast (but the current implementation hasn't fully taken advantage of this yet).

Know Limitations

  1. I haven't fully solved device management for patch types that require the original layer value to calculate the patch. These aren't very common, and are not compatible with some quantized layers, so leaving this for future if there's demand.
  2. There is a small speed regression for models that have CPU bottlenecks. This seems to be caused by slightly slower method resolution on the custom layers sub-classes. The regression does not show up on larger models, like FLUX, that are almost entirely GPU-limited. I think this small regression is tolerable, but if we decide that it's not, then the slowdown can easily be reclaimed by optimizing other CPU operations (e.g. if we only sent every 2nd progress image, we'd see a much more significant speedup).

Related Issues / Discussions

QA Instructions

Speed tests:

  • Vanilla SD1 speed regression
    • Before: 3.156s (8.78 it/s)
    • After: 3.54s (8.35 it/s)
  • Vanilla SDXL speed regression
    • Before: 6.23s (4.46 it/s)
    • After: 6.45s (4.31 it/s)
  • Vanilla FLUX speed regression
    • Before: 12.02s (2.27 it/s)
    • After: 11.91s (2.29 it/s)

LoRA tests with default configuration:

  • SD1: A handful of LoRA variants
  • SDXL: A handful of LoRA variants
  • flux non-quantized: multiple lora variants
  • flux bnb-quantized: multiple lora variants
  • flux ggml-quantized: muliple lora variants
  • flux non-quantized: FLUX control LoRA
  • flux bnb-quantized: FLUX control LoRA
  • flux ggml-quantized: FLUX control LoRA

LoRA tests with sidecar patching forced:

  • SD1: A handful of LoRA variants
  • SDXL: A handful of LoRA variants
  • flux non-quantized: multiple lora variants
  • flux bnb-quantized: multiple lora variants
  • flux ggml-quantized: muliple lora variants
  • flux non-quantized: FLUX control LoRA
  • flux bnb-quantized: FLUX control LoRA
  • flux ggml-quantized: FLUX control LoRA

Other:

  • Smoke testing of IP-Adapter, ControlNet

All tests repeated on:

  • cuda
  • cpu (only test SD1, because larger models are prohibitively slow)
  • mps (skipped FLUX tests, because my Mac doesn't have enough memory to run them in a reasonable amount of time)

Merge Plan

No special instructions.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

…that a fresh instance of the layer under test is initialized for each test.
…casting (since it incurs some runtime speed overhead.)
…rameters rather than orig_module. This will enable compatibility between patching and cpu->gpu streaming.
@github-actions github-actions bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files python-tests PRs that change python tests labels Dec 29, 2024
@RyanJDick RyanJDick merged commit b46d7ab into main Dec 31, 2024
29 checks passed
@RyanJDick RyanJDick deleted the ryan/model-offload-3-smart-lora-patcher-v2 branch December 31, 2024 18:58
RyanJDick added a commit that referenced this pull request Jan 7, 2025
…#7522)

## Summary

This is an unplanned fix between PR3 and PR4 in the sequence of partial
loading (i.e. low-VRAM) PRs. This PR restores the 'Current Workaround'
documented in #7513. In
other words, to work around a flaw in the model cache API, this fix
allows models to be loaded into VRAM _even if_ they have been dropped
from the RAM cache.

This PR also adds an info log each time that this workaround is hit. In
a future PR (#7509), we will eliminate the places in the application
code that are capable of triggering this condition.

## Related Issues / Discussions

- #7492 
- #7494
- #7500 
- #7513

## QA Instructions

- Set RAM cache limit to a small value. E.g. `ram: 4`
- Run FLUX text-to-image with the full T5 encoder, which exceeds 4GB.
This will trigger the error condition.
- Before the fix, this test configuration would cause a `KeyError`.
After the fix, we should see an info-level log explaining that the
condition was hit, but that generation should continue successfully.

## Merge Plan

No special instructions.

## Checklist

- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
RyanJDick added a commit that referenced this pull request Jan 7, 2025
## Summary

This PR adds support for partial loading of models onto the GPU. This
enables models to run with much lower peak VRAM requirements (e.g. full
FLUX dev with 8GB of VRAM).

The partial loading feature is enabled behind a new config flag:
`enable_partial_loading=True`. This flag defaults to `False`.

**Note about performance:**
The `ram` and `vram` config limits are still applied when
`enable_partial_loading=True` is set. This can result in significant
slowdowns compared to the 'old' behaviour. Consider the case where the
VRAM limit is set to `vram=0.75` (GB) and we are trying to run an 8GB
model. When `enable_partial_loading=False`, we attempt to load the
entire model into VRAM, and if it fits (no OOM error) then it will run
at full speed. When `enable_partial_loading=True`, since we have the
option to partially load the model we will only load 0.75 GB into VRAM
and leave the remaining 7.25 GB in RAM. This will cause inference to be
much slower than before. To workaround this, it is important that your
`ram` and `vram` configs are carefully tuned. In a future PR, we will
add the ability to dynamically set the RAM/VRAM limits based on the
available memory / VRAM.

## Related Issues / Discussions

- #7492 
- #7494 
- #7500

## QA Instructions

Tests with `enable_partial_loading=True`, `vram=2`, on CUDA device:
For all tests, we expect model memory to stay below 2 GB. Peak working
memory will be higher.
- [x] SD1 inference
- [x] SDXL inference
- [x] FLUX non-quantized inference
- [x] FLUX GGML-quantized inference
- [x] FLUX BnB quantized inference
- [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests

Tests with `enable_partial_loading=True`, and hack to force all models
to load 10%, on CUDA device:
- [x] SD1 inference
- [x] SDXL inference
- [x] FLUX non-quantized inference
- [x] FLUX GGML-quantized inference
- [x] FLUX BnB quantized inference
- [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests

Tests with `enable_partial_loading=False`, `vram=30`:
We expect no change in behaviour when  `enable_partial_loading=False`.
- [x] SD1 inference
- [x] SDXL inference
- [x] FLUX non-quantized inference
- [x] FLUX GGML-quantized inference
- [x] FLUX BnB quantized inference
- [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests

Other platforms:
- [x] No change in behavior on MPS, even if
`enable_partial_loading=True`.
- [x] No change in behavior on CPU-only systems, even if
`enable_partial_loading=True`.

## Merge Plan

- [x] Merge #7500 first, and change the target branch to main

## Checklist

- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
RyanJDick added a commit that referenced this pull request Jan 7, 2025
## Summary

This PR enables RAM/VRAM cache size limits to be determined dynamically
based on availability.

**Config Changes**

This PR modifies the app configs in the following ways:
- A new `device_working_mem_gb` config was added. This is the amount of
non-model working memory to keep available on the execution device (i.e.
GPU) when using dynamic cache limits. It default to 3GB.
- The `ram` and `vram` configs now default to `None`. If these configs
are set, they will take precedence over the dynamic limits. **Note: Some
users may have previously overriden the `ram` and `vram` values in their
`invokeai.yaml`. They will need to remove these configs to enable the
new dynamic limit feature.**

**Working Memory**

In addition to the new `device_working_mem_gb` config described above,
memory-intensive operations can estimate the amount of working memory
that they will need and request it from the model cache. This is
currently applied to the VAE decoding step for all models. In the
future, we may apply this to other operations as we work out which ops
tend to exceed the default working memory reservation.

**Mitigations for #7513

This PR includes some mitigations for the issue described in
#7513. Without these
mitigations, it would occur with higher frequency when dynamic RAM
limits are used and the RAM is close to maxed-out.

## Limitations / Future Work

- Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE
decoding requires more working VRAM than available, the best we can do
is keep the full model on the CPU, but we will still hit an OOM error.
In the future, we could detect this ahead of time and switch to running
inference on the CPU for those ops.
- There is often a non-negligible amount of VRAM 'reserved' by the torch
CUDA allocator, but not used by any allocated tensors. We may be able to
tune the torch CUDA allocator to work better for our use case.
Reference:
https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf
- There may be some ops that require high working memory that haven't
been updated to request extra memory yet. We will update these as we
uncover them.
- If a model is 'locked' in VRAM, it won't be partially unloaded if a
later model load requests extra working memory. This should be uncommon,
but I can think of cases where it would matter.

## Related Issues / Discussions

- #7492 
- #7494 
- #7500 
- #7505 

## QA Instructions

Run a variety of models near the cache limits to ensure that model
switching works properly for the following configurations:
- [x] CUDA, `enable_partial_loading=true`, all other configs default
(i.e. dynamic memory limits)
- [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved
in another process so there is limited RAM/VRAM remaining, all other
configs default (i.e. dynamic memory limits)
- [x] CUDA, `enable_partial_loading=false`, all other configs default
(i.e. dynamic memory limits)
- [x] CUDA, ram/vram limits set (these should take precedence over the
dynamic limits)
- [x] MPS, all other default (i.e. dynamic memory limits)
- [x] CPU, all other default (i.e. dynamic memory limits) 

## Merge Plan

- [x] Merge #7505 first and change target branch to main

## Checklist

- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend PRs that change backend files invocations PRs that change invocations python PRs that change python files python-tests PRs that change python tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants