Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization issues causing slow speed #32

Open
Loko415 opened this issue Mar 18, 2024 · 3 comments
Open

Optimization issues causing slow speed #32

Loko415 opened this issue Mar 18, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@Loko415
Copy link

Loko415 commented Mar 18, 2024

Why are you using bfp16? From my research this is causing slower speed and is often used for training instead? Also why are you offloading to CPU? And why are you freeing cuda memory each time? This just makes the model reload each time. I may be completely wrong idk much about ai but I find it strange. Can you explain this to me? How much vram does this model use. 12? Maybe it's using too much vram I only got a 3060 with 12gb ram maybe it's going into swap ram sys ram that's why it's slower than sdxl.

@IndigoDosSantos
Copy link
Owner

Why 🧠float16 (bfp16)? I'm using bfloat16 due to its smaller memory requirement. Although bfloat16 provides less precision compared to float32 or half-precision float (fp16), it retains the extensive range needed to represent very large or tiny numbers. In diffusion models like Stable Diffusion and Stable Cascade, this compromise in precision is generally acceptable.

Why CPU offloading? Offloading models to the CPU helps preserve valuable VRAM, especially for users lacking top-tier GPUs such as the 4090.

Why clear CUDA memory? Clearing CUDA memory after each process (prior and decoder) helps reduce VRAM consumption. Since Stable Cascade operates in a sequential manner, it's unnecessary to keep both the prior and decoder models in VRAM at the same time. This strategy lessens memory demands, particularly on systems with limited VRAM capacity.

VRAM Usage: My configuration necessitates approximately 11GB of VRAM with models offloaded.

Comparison to SDXL: Indeed, Stable Cascade doesn't match SDXL in speed. Due to its more compact size, SDXL can typically remain fully loaded in VRAM, avoiding the loading delays that come with offloading.

Btw, found a few lines in the loading process that are not needed, even slow the model loading process further down. Deleted them and this is the, although small, (but we take what we get, aren’t we? 😅) impact:

diagram_issue-32

IndigoDosSantos added a commit that referenced this issue Mar 18, 2024
Found a few lines in the loading process that are not needed, even slow the model loading process further down. See [issue #32 ](#32) for reference.
@Loko415
Copy link
Author

Loko415 commented Mar 19, 2024

Very nice ☺️

@Loko415
Copy link
Author

Loko415 commented Mar 19, 2024

Should have preset high vram low vram etc incase their GPU can take it 👍

@IndigoDosSantos IndigoDosSantos added the enhancement New feature or request label Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants