-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization issues causing slow speed #32
Comments
Why 🧠float16 (bfp16)? I'm using bfloat16 due to its smaller memory requirement. Although bfloat16 provides less precision compared to float32 or half-precision float (fp16), it retains the extensive range needed to represent very large or tiny numbers. In diffusion models like Stable Diffusion and Stable Cascade, this compromise in precision is generally acceptable. Why CPU offloading? Offloading models to the CPU helps preserve valuable VRAM, especially for users lacking top-tier GPUs such as the 4090. Why clear CUDA memory? Clearing CUDA memory after each process (prior and decoder) helps reduce VRAM consumption. Since Stable Cascade operates in a sequential manner, it's unnecessary to keep both the prior and decoder models in VRAM at the same time. This strategy lessens memory demands, particularly on systems with limited VRAM capacity. VRAM Usage: My configuration necessitates approximately 11GB of VRAM with models offloaded. Comparison to SDXL: Indeed, Stable Cascade doesn't match SDXL in speed. Due to its more compact size, SDXL can typically remain fully loaded in VRAM, avoiding the loading delays that come with offloading. Btw, found a few lines in the loading process that are not needed, even slow the model loading process further down. Deleted them and this is the, although small, (but we take what we get, aren’t we? 😅) impact: |
Very nice |
Should have preset high vram low vram etc incase their GPU can take it 👍 |
Why are you using bfp16? From my research this is causing slower speed and is often used for training instead? Also why are you offloading to CPU? And why are you freeing cuda memory each time? This just makes the model reload each time. I may be completely wrong idk much about ai but I find it strange. Can you explain this to me? How much vram does this model use. 12? Maybe it's using too much vram I only got a 3060 with 12gb ram maybe it's going into swap ram sys ram that's why it's slower than sdxl.
The text was updated successfully, but these errors were encountered: