-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Max Token Limit for Generation #1078
base: main
Are you sure you want to change the base?
Conversation
Can you elaborate on that? What behavior is different if you set the max toks per sec? |
The model cannot decode faster than the maximum tokens per second. |
I meant when you say "avoids thermal throttling" what are you referring to and how do you detect that it is being "avoided"? |
When too much power is exerted, laptops with M-series chip drop to very low performance. Users can manually set the throughput of the model lower to prevent this. |
When you say drop to very low performance what does that look like? I’m just trying to understand what’s happening here because maybe there is a deeper issue and manually sleeping in the generation loop could be suboptimal. |
Has this never happened to you? Set up a long generation, it draws ~30W, and then the computer overheats and drops to like ~1W of power draw for 2 minutes to cool down. Throttling helps. |
🤔 no it hasn't. I'd like to reproduce it, roughly how long of a generation with what size model do you experience that? |
Does that happen during the generation? Then it slows down? |
Yes! It does - have you not experienced it?? I can provide a video! (MLX generation for more than ~30 seconds at full throttle results in my 14-inch M3 Max throttling itself so aggressively the screen stutters) |
This works for ANY model btw - as long as the computer is running full throttle! |
Thoughts? |
I just ran this (with no stop condition):
It generated 30k tokens in about 11 mins. The fan was going full speed and the power draw was consistently 30-35 watts on an M3 max. Here's the stats:
|
Now I'm wondering what we are doing differently? |
Is it a 14 inch or 16inch m3 max...? |
16 inch MLX on main |
I'm wondering how much RAM you have? Maybe it's swapping and that's what accounts for the cliff? |
64GB 14inch M3 Max MLX LM (pretty much latest version) It's thermal throttling that occurs in smaller macs! |
I ran your exact thing on my 14-inch. It has now dropped to ~1.8W and is stuttering horrbbly as it desperately tries to cool down. |
Huh, so what happens if you try to train on it? Does it hit the same perf cliff? |
Oh training does the exact same thing - LORAing always makes the computer throttle brutally. |
Could you share some rough numbers on toks/sec pre and post throttling? |
Im still waiting for the benchmark to complete. It runs at the same tok/sec you report when non-throttled. But I'll report the avg tok/sec on the 30000 tok generation task. |
Thanks! Also curious for LoRA fine-tuning if you have anything readily available. No worries if not. |
Here's the benchmark: Prompt: 15 tokens, 138.944 tokens-per-sec (The slowdown on LORA is similar - roughly 1/3rd of what the max throughput is) |
So yeah - do you think this would be a welcome change? |
Let’s keep the PR open for now. I’m not done investigating this yet. We may or may not merge it depending.. but I appreciate you helping us figure out the underlying issue. |
Makes sense! Thanks for your openness in this investigation :D |
Example of what happens during LORA (or in this case full) finetuning of SmolLM2 135M: Iter 560: Train loss 2.217, Learning Rate 3.000e-05, It/sec 2.906, Tokens/sec 1507.738, Trained Tokens 391391, Peak mem 9.893 GB |
I confirm that 16" is not affected with fan at Max speed, while 14" is really impacted. Both M3 Max and M4 Max models. Slowing down generation can help to reduce temp. |
Not true, MBP 16" is impacted too, being able to slowdown MLX would help to avoid throttling and keep Mac less noisy |
Allows the throttling of the generation process to a maximum number of tok/sec - meaning that the user can control what percent of their GPU power goes into LLM generation. Avoids thermal throttling.