Partial Loading PR3: Integrate 1) partial loading, 2) quantized models, 3) model patching #7500

RyanJDick · 2024-12-29T22:32:50Z

Summary

This PR is the third in a sequence of PRs working towards support for partial loading of models onto the compute device (for low-VRAM operation). This PR updates the LoRA patching code so that the following features can cooperate fully:

Partial loading of weights onto the GPU
Quantized layers / weights
Model patches (e.g. LoRA)

Note that this PR does not yet enable partial loading. It adds support in the model patching code so that partial loading can be enabled in a future PR.

Technical Design Decisions

The layer patching logic has been integrated into the custom layers (via CustomModuleMixin) rather than keeping it in a separate set of wrapper layers, as before. This has the following advantages:

It makes it easier to calculate the modified weights on the fly and then reuse the normal forward() logic.
In the future, it makes it possible to pass original parameters that have been cast to the device down to the LoRA calculation without having to re-cast (but the current implementation hasn't fully taken advantage of this yet).

Know Limitations

I haven't fully solved device management for patch types that require the original layer value to calculate the patch. These aren't very common, and are not compatible with some quantized layers, so leaving this for future if there's demand.
There is a small speed regression for models that have CPU bottlenecks. This seems to be caused by slightly slower method resolution on the custom layers sub-classes. The regression does not show up on larger models, like FLUX, that are almost entirely GPU-limited. I think this small regression is tolerable, but if we decide that it's not, then the slowdown can easily be reclaimed by optimizing other CPU operations (e.g. if we only sent every 2nd progress image, we'd see a much more significant speedup).

Related Issues / Discussions

QA Instructions

Speed tests:

Vanilla SD1 speed regression
- Before: 3.156s (8.78 it/s)
- After: 3.54s (8.35 it/s)
Vanilla SDXL speed regression
- Before: 6.23s (4.46 it/s)
- After: 6.45s (4.31 it/s)
Vanilla FLUX speed regression
- Before: 12.02s (2.27 it/s)
- After: 11.91s (2.29 it/s)

LoRA tests with default configuration:

SD1: A handful of LoRA variants
SDXL: A handful of LoRA variants
flux non-quantized: multiple lora variants
flux bnb-quantized: multiple lora variants
flux ggml-quantized: muliple lora variants
flux non-quantized: FLUX control LoRA
flux bnb-quantized: FLUX control LoRA
flux ggml-quantized: FLUX control LoRA

LoRA tests with sidecar patching forced:

SD1: A handful of LoRA variants
SDXL: A handful of LoRA variants
flux non-quantized: multiple lora variants
flux bnb-quantized: multiple lora variants
flux ggml-quantized: muliple lora variants
flux non-quantized: FLUX control LoRA
flux bnb-quantized: FLUX control LoRA
flux ggml-quantized: FLUX control LoRA

Other:

Smoke testing of IP-Adapter, ControlNet

All tests repeated on:

cuda
cpu (only test SD1, because larger models are prohibitively slow)
mps (skipped FLUX tests, because my Mac doesn't have enough memory to run them in a reasonable amount of time)

Merge Plan

No special instructions.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

…behavior of non-smart mode.

…tion.

…irectory.

… source file structure.

…ting from cpu to device).

…ustom module types.

…that a fresh instance of the layer under test is initialized for each test.

…casting (since it incurs some runtime speed overhead.)

…st (more to come).

…nd ConcatenatedLoRALayers.

… unit test for it.

…SNorm layer.

…rameters rather than orig_module. This will enable compatibility between patching and cpu->gpu streaming.

…d CustomConv2d.

…ing.

…from the CPU to the GPU.

…with weights streamed from CPU to GPU.

…ed into the custom layers.

…#7522) ## Summary This is an unplanned fix between PR3 and PR4 in the sequence of partial loading (i.e. low-VRAM) PRs. This PR restores the 'Current Workaround' documented in #7513. In other words, to work around a flaw in the model cache API, this fix allows models to be loaded into VRAM _even if_ they have been dropped from the RAM cache. This PR also adds an info log each time that this workaround is hit. In a future PR (#7509), we will eliminate the places in the application code that are capable of triggering this condition. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7513 ## QA Instructions - Set RAM cache limit to a small value. E.g. `ram: 4` - Run FLUX text-to-image with the full T5 encoder, which exceeds 4GB. This will trigger the error condition. - Before the fix, this test configuration would cause a `KeyError`. After the fix, we should see an info-level log explaining that the condition was hit, but that generation should continue successfully. ## Merge Plan No special instructions. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

## Summary This PR adds support for partial loading of models onto the GPU. This enables models to run with much lower peak VRAM requirements (e.g. full FLUX dev with 8GB of VRAM). The partial loading feature is enabled behind a new config flag: `enable_partial_loading=True`. This flag defaults to `False`. **Note about performance:** The `ram` and `vram` config limits are still applied when `enable_partial_loading=True` is set. This can result in significant slowdowns compared to the 'old' behaviour. Consider the case where the VRAM limit is set to `vram=0.75` (GB) and we are trying to run an 8GB model. When `enable_partial_loading=False`, we attempt to load the entire model into VRAM, and if it fits (no OOM error) then it will run at full speed. When `enable_partial_loading=True`, since we have the option to partially load the model we will only load 0.75 GB into VRAM and leave the remaining 7.25 GB in RAM. This will cause inference to be much slower than before. To workaround this, it is important that your `ram` and `vram` configs are carefully tuned. In a future PR, we will add the ability to dynamically set the RAM/VRAM limits based on the available memory / VRAM. ## Related Issues / Discussions - #7492 - #7494 - #7500 ## QA Instructions Tests with `enable_partial_loading=True`, `vram=2`, on CUDA device: For all tests, we expect model memory to stay below 2 GB. Peak working memory will be higher. - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Tests with `enable_partial_loading=True`, and hack to force all models to load 10%, on CUDA device: - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Tests with `enable_partial_loading=False`, `vram=30`: We expect no change in behaviour when `enable_partial_loading=False`. - [x] SD1 inference - [x] SDXL inference - [x] FLUX non-quantized inference - [x] FLUX GGML-quantized inference - [x] FLUX BnB quantized inference - [x] Variety of ControlNet / IP-Adapter / LoRA smoke tests Other platforms: - [x] No change in behavior on MPS, even if `enable_partial_loading=True`. - [x] No change in behavior on CPU-only systems, even if `enable_partial_loading=True`. ## Merge Plan - [x] Merge #7500 first, and change the target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

## Summary This PR enables RAM/VRAM cache size limits to be determined dynamically based on availability. **Config Changes** This PR modifies the app configs in the following ways: - A new `device_working_mem_gb` config was added. This is the amount of non-model working memory to keep available on the execution device (i.e. GPU) when using dynamic cache limits. It default to 3GB. - The `ram` and `vram` configs now default to `None`. If these configs are set, they will take precedence over the dynamic limits. **Note: Some users may have previously overriden the `ram` and `vram` values in their `invokeai.yaml`. They will need to remove these configs to enable the new dynamic limit feature.** **Working Memory** In addition to the new `device_working_mem_gb` config described above, memory-intensive operations can estimate the amount of working memory that they will need and request it from the model cache. This is currently applied to the VAE decoding step for all models. In the future, we may apply this to other operations as we work out which ops tend to exceed the default working memory reservation. **Mitigations for #7513 This PR includes some mitigations for the issue described in #7513. Without these mitigations, it would occur with higher frequency when dynamic RAM limits are used and the RAM is close to maxed-out. ## Limitations / Future Work - Only _models_ can be offloaded to RAM to conserve VRAM. I.e. if VAE decoding requires more working VRAM than available, the best we can do is keep the full model on the CPU, but we will still hit an OOM error. In the future, we could detect this ahead of time and switch to running inference on the CPU for those ops. - There is often a non-negligible amount of VRAM 'reserved' by the torch CUDA allocator, but not used by any allocated tensors. We may be able to tune the torch CUDA allocator to work better for our use case. Reference: https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf - There may be some ops that require high working memory that haven't been updated to request extra memory yet. We will update these as we uncover them. - If a model is 'locked' in VRAM, it won't be partially unloaded if a later model load requests extra working memory. This should be uncommon, but I can think of cases where it would matter. ## Related Issues / Discussions - #7492 - #7494 - #7500 - #7505 ## QA Instructions Run a variety of models near the cache limits to ensure that model switching works properly for the following configurations: - [x] CUDA, `enable_partial_loading=true`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=true`, CPU and CUDA memory reserved in another process so there is limited RAM/VRAM remaining, all other configs default (i.e. dynamic memory limits) - [x] CUDA, `enable_partial_loading=false`, all other configs default (i.e. dynamic memory limits) - [x] CUDA, ram/vram limits set (these should take precedence over the dynamic limits) - [x] MPS, all other default (i.e. dynamic memory limits) - [x] CPU, all other default (i.e. dynamic memory limits) ## Merge Plan - [x] Merge #7505 first and change target branch to main ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

RyanJDick added 30 commits December 24, 2024 15:57

Add LoRAPatcher.smart_apply_lora_patches()

cefcb34

Add test_apply_smart_lora_patches_to_partially_loaded_model(...).

d0f35fc

(minor) Rename num_layers -> num_loras in unit tests.

0148512

Enable LoRAPatcher.apply_smart_lora_patches(...) throughout the stack.

61253b9

Update apply_smart_model_patches() so that layer restore matches the …

6f926f0

…behavior of non-smart mode.

Rename model_patcher.py -> layer_patcher.py.

80db953

Consolidate the LayerPatching patching modes into a single implementa…

6d7314a

…tion.

Move custom autocast modules to separate files in a custom_modules/ d…

987c9ae

…irectory.

Split test_autocast_modules.py into separate test files to mirror the…

0394419

… source file structure.

Add inference tests for all custom module types (i.e. to test autocas…

a8b2c4c

…ting from cpu to device).

Add unit test to test that isinstance(...) behaves as expected with c…

b0b699a

…ustom module types.

Use a fixture to parameterize tests in test_all_custom_modules.py so …

9692a36

…that a fresh instance of the layer under test is initialized for each test.

Add a CustomModuleMixin class with a flag for enabling/disabling auto…

7d6ab0c

…casting (since it incurs some runtime speed overhead.)

Improve custom layer wrap/unwrap logic.

b06d61e

Add support for patches to CustomModuleMixin and add a single unit te…

e24e386

…st (more to come).

Add more unit tests for custom module LoRA patching: multiple LoRAs a…

5ee7405

…nd ConcatenatedLoRALayers.

Add support for FluxControlLoRALayer in CustomLinear layers and add a…

ef970a1

… unit test for it.

Get custom layer patches working with all quantized linear layer types.

f298197

Add patch support to CustomConv1d and CustomConv2d (no unit tests yet).

f692e21

Add CustomFluxRMSNorm layer.

93e76b6

Add unit test for a SetParameterLayer patch applied to a CustomFluxRM…

918f541

…SNorm layer.

Raise in CustomEmbedding and CustomGroupNorm if a patch is applied.

20acfc9

Update BaseLayerPatch.get_parameters(...) to accept a dict of orig_pa…

2855bb6

…rameters rather than orig_module. This will enable compatibility between patching and cpu->gpu streaming.

Fix the _autocast_forward_with_patches() function for CustomConv1d an…

0525f96

…d CustomConv2d.

Switch the LayerPatcher to use 'custom modules' to manage layer patch…

6d49ee8

…ing.

First pass at making custom layer patches work with weights streamed …

a8bef59

…from the CPU to the GPU.

Add a unit test for a LoRA patch applied to a quantized linear layer …

52fc5a6

…with weights streamed from CPU to GPU.

Delete old sidecar wrapper implementation. This functionality has mov…

6fd9b0a

…ed into the custom layers.

Fix bug in CustomConv1d and CustomConv2d patch calculations.

8b4b0ff

Fix layer patch dtype selection for CLIP text encoder models.

477d87e

github-actions bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files python-tests PRs that change python tests labels Dec 29, 2024

Fix bitsandbytes imports in unit tests on MacOS.

9a0a226

RyanJDick marked this pull request as ready for review December 30, 2024 18:39

RyanJDick requested review from lstein, blessedcoolant, brandonrising, hipsterusername and psychedelicious as code owners December 30, 2024 18:39

hipsterusername approved these changes Dec 30, 2024

View reviewed changes

RyanJDick mentioned this pull request Dec 31, 2024

Partial Loading PR4: Enable partial loading (behind config flag) #7505

Merged

25 tasks

RyanJDick merged commit b46d7ab into main Dec 31, 2024
29 checks passed

RyanJDick deleted the ryan/model-offload-3-smart-lora-patcher-v2 branch December 31, 2024 18:58

This was referenced Jan 2, 2025

Partial Loading PR5: Dynamic cache ram/vram limits #7509

Merged

Partial Loading PR 3.5: Fix pre-mature model drops from the RAM cache #7522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial Loading PR3: Integrate 1) partial loading, 2) quantized models, 3) model patching #7500

Partial Loading PR3: Integrate 1) partial loading, 2) quantized models, 3) model patching #7500

RyanJDick commented Dec 29, 2024 •

edited

Loading

Partial Loading PR3: Integrate 1) partial loading, 2) quantized models, 3) model patching #7500

Partial Loading PR3: Integrate 1) partial loading, 2) quantized models, 3) model patching #7500

Conversation

RyanJDick commented Dec 29, 2024 • edited Loading

Summary

Technical Design Decisions

Know Limitations

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

RyanJDick commented Dec 29, 2024 •

edited

Loading