Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Max Token Limit for Generation #1078

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

N8python
Copy link
Contributor

Allows the throttling of the generation process to a maximum number of tok/sec - meaning that the user can control what percent of their GPU power goes into LLM generation. Avoids thermal throttling.

@awni
Copy link
Member

awni commented Oct 31, 2024

Avoids thermal throttling.

Can you elaborate on that? What behavior is different if you set the max toks per sec?

@N8python
Copy link
Contributor Author

The model cannot decode faster than the maximum tokens per second.

@awni
Copy link
Member

awni commented Oct 31, 2024

I meant when you say "avoids thermal throttling" what are you referring to and how do you detect that it is being "avoided"?

@N8python
Copy link
Contributor Author

When too much power is exerted, laptops with M-series chip drop to very low performance. Users can manually set the throughput of the model lower to prevent this.

@awni
Copy link
Member

awni commented Oct 31, 2024

When you say drop to very low performance what does that look like? I’m just trying to understand what’s happening here because maybe there is a deeper issue and manually sleeping in the generation loop could be suboptimal.

@N8python
Copy link
Contributor Author

Has this never happened to you? Set up a long generation, it draws ~30W, and then the computer overheats and drops to like ~1W of power draw for 2 minutes to cool down. Throttling helps.

@awni
Copy link
Member

awni commented Oct 31, 2024

Set up a long generation, it draws ~30W, and then the computer overheats and drops to like ~1W of power draw for 2 minutes to cool down.

🤔 no it hasn't. I'd like to reproduce it, roughly how long of a generation with what size model do you experience that?

@awni
Copy link
Member

awni commented Oct 31, 2024

the computer overheats and drops to like ~1W of power draw

Does that happen during the generation? Then it slows down?

@N8python
Copy link
Contributor Author

Yes! It does - have you not experienced it?? I can provide a video!

(MLX generation for more than ~30 seconds at full throttle results in my 14-inch M3 Max throttling itself so aggressively the screen stutters)

@N8python
Copy link
Contributor Author

This works for ANY model btw - as long as the computer is running full throttle!

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

Thoughts?

@awni
Copy link
Member

awni commented Nov 1, 2024

I just ran this (with no stop condition):

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit -m 30000 --prompt "What is generative AI"

It generated 30k tokens in about 11 mins. The fan was going full speed and the power draw was consistently 30-35 watts on an M3 max.

Here's the stats:

Prompt: 15 tokens, 50.681 tokens-per-sec
Generation: 30000 tokens, 46.960 tokens-per-sec
Peak memory: 8.126 GB

@awni
Copy link
Member

awni commented Nov 1, 2024

Now I'm wondering what we are doing differently?

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

Is it a 14 inch or 16inch m3 max...?

@awni
Copy link
Member

awni commented Nov 1, 2024

16 inch
64 GB
OS 15.0.1

MLX on main
MLX LM on main

@awni
Copy link
Member

awni commented Nov 1, 2024

I'm wondering how much RAM you have? Maybe it's swapping and that's what accounts for the cliff?

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

64GB 14inch M3 Max MLX LM (pretty much latest version) It's thermal throttling that occurs in smaller macs!

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

I ran your exact thing on my 14-inch. It has now dropped to ~1.8W and is stuttering horrbbly as it desperately tries to cool down.

@awni
Copy link
Member

awni commented Nov 1, 2024

Huh, so what happens if you try to train on it? Does it hit the same perf cliff?

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

Oh training does the exact same thing - LORAing always makes the computer throttle brutally.

@awni
Copy link
Member

awni commented Nov 1, 2024

Oh training does the exact same thing - LORAing always makes the computer throttle brutally.

Could you share some rough numbers on toks/sec pre and post throttling?

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

Im still waiting for the benchmark to complete. It runs at the same tok/sec you report when non-throttled. But I'll report the avg tok/sec on the 30000 tok generation task.

@awni
Copy link
Member

awni commented Nov 1, 2024

Thanks! Also curious for LoRA fine-tuning if you have anything readily available. No worries if not.

@N8python
Copy link
Contributor Author

N8python commented Nov 1, 2024

Here's the benchmark:

Prompt: 15 tokens, 138.944 tokens-per-sec
Generation: 30000 tokens, 16.352 tokens-per-sec
Peak memory: 8.346 GB

(The slowdown on LORA is similar - roughly 1/3rd of what the max throughput is)

@N8python
Copy link
Contributor Author

N8python commented Nov 2, 2024

So yeah - do you think this would be a welcome change?

@awni
Copy link
Member

awni commented Nov 2, 2024

Let’s keep the PR open for now. I’m not done investigating this yet. We may or may not merge it depending.. but I appreciate you helping us figure out the underlying issue.

@N8python
Copy link
Contributor Author

N8python commented Nov 2, 2024

Makes sense! Thanks for your openness in this investigation :D

@N8python
Copy link
Contributor Author

N8python commented Nov 2, 2024

Example of what happens during LORA (or in this case full) finetuning of SmolLM2 135M:

Iter 560: Train loss 2.217, Learning Rate 3.000e-05, It/sec 2.906, Tokens/sec 1507.738, Trained Tokens 391391, Peak mem 9.893 GB
Iter 570: Train loss 2.091, Learning Rate 3.000e-05, It/sec 1.618, Tokens/sec 1547.780, Trained Tokens 400957, Peak mem 9.893 GB
Iter 580: Train loss 1.820, Learning Rate 3.000e-05, It/sec 1.877, Tokens/sec 1077.392, Trained Tokens 406696, Peak mem 9.893 GB
Iter 590: Train loss 1.919, Learning Rate 3.000e-05, It/sec 1.490, Tokens/sec 1374.364, Trained Tokens 415923, Peak mem 9.893 GB
Iter 600: Train loss 2.071, Learning Rate 3.000e-05, It/sec 2.043, Tokens/sec 1326.902, Trained Tokens 422418, Peak mem 9.893 GB
Iter 600: Saved adapter weights to adapters/adapters.safetensors and adapters/0000600_adapters.safetensors.
Iter 610: Train loss 2.017, Learning Rate 3.000e-05, It/sec 1.077, Tokens/sec 791.562, Trained Tokens 429771, Peak mem 9.893 GB
Iter 620: Train loss 2.397, Learning Rate 3.000e-05, It/sec 1.575, Tokens/sec 985.628, Trained Tokens 436027, Peak mem 9.893 GB
Iter 630: Train loss 2.019, Learning Rate 3.000e-05, It/sec 1.401, Tokens/sec 817.147, Trained Tokens 441861, Peak mem 9.893 GB
Iter 640: Train loss 1.931, Learning Rate 3.000e-05, It/sec 1.037, Tokens/sec 708.138, Trained Tokens 448692, Peak mem 9.893 GB
Iter 650: Train loss 2.295, Learning Rate 3.000e-05, It/sec 0.938, Tokens/sec 483.847, Trained Tokens 453853, Peak mem 9.893 GB
Iter 660: Train loss 1.740, Learning Rate 3.000e-05, It/sec 0.540, Tokens/sec 368.923, Trained Tokens 460687, Peak mem 9.893 GB
Iter 670: Train loss 1.884, Learning Rate 3.000e-05, It/sec 0.218, Tokens/sec 176.933, Trained Tokens 468785, Peak mem 9.893 GB
Iter 680: Train loss 2.026, Learning Rate 3.000e-05, It/sec 0.264, Tokens/sec 206.046, Trained Tokens 476577, Peak mem 9.893 GB
Iter 690: Train loss 2.112, Learning Rate 3.000e-05, It/sec 0.230, Tokens/sec 211.577, Trained Tokens 485780, Peak mem 9.893 GB

@ivanfioravanti
Copy link
Contributor

I confirm that 16" is not affected with fan at Max speed, while 14" is really impacted. Both M3 Max and M4 Max models. Slowing down generation can help to reduce temp.

@ivanfioravanti
Copy link
Contributor

Not true, MBP 16" is impacted too, being able to slowdown MLX would help to avoid throttling and keep Mac less noisy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants