Add Max Token Limit for Generation #1078

N8python · 2024-10-31T06:44:12Z

Allows the throttling of the generation process to a maximum number of tok/sec - meaning that the user can control what percent of their GPU power goes into LLM generation. Avoids thermal throttling.

awni · 2024-10-31T13:12:05Z

Avoids thermal throttling.

Can you elaborate on that? What behavior is different if you set the max toks per sec?

N8python · 2024-10-31T13:49:28Z

The model cannot decode faster than the maximum tokens per second.

awni · 2024-10-31T13:54:03Z

I meant when you say "avoids thermal throttling" what are you referring to and how do you detect that it is being "avoided"?

N8python · 2024-10-31T15:39:13Z

When too much power is exerted, laptops with M-series chip drop to very low performance. Users can manually set the throughput of the model lower to prevent this.

awni · 2024-10-31T15:46:27Z

When you say drop to very low performance what does that look like? I’m just trying to understand what’s happening here because maybe there is a deeper issue and manually sleeping in the generation loop could be suboptimal.

N8python · 2024-10-31T16:04:43Z

Has this never happened to you? Set up a long generation, it draws ~30W, and then the computer overheats and drops to like ~1W of power draw for 2 minutes to cool down. Throttling helps.

awni · 2024-10-31T16:33:14Z

Set up a long generation, it draws ~30W, and then the computer overheats and drops to like ~1W of power draw for 2 minutes to cool down.

🤔 no it hasn't. I'd like to reproduce it, roughly how long of a generation with what size model do you experience that?

awni · 2024-10-31T16:33:57Z

the computer overheats and drops to like ~1W of power draw

Does that happen during the generation? Then it slows down?

N8python · 2024-10-31T16:50:30Z

Yes! It does - have you not experienced it?? I can provide a video!

(MLX generation for more than ~30 seconds at full throttle results in my 14-inch M3 Max throttling itself so aggressively the screen stutters)

N8python · 2024-10-31T16:51:28Z

This works for ANY model btw - as long as the computer is running full throttle!

N8python · 2024-11-01T07:13:32Z

Thoughts?

awni · 2024-11-01T19:37:18Z

I just ran this (with no stop condition):

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit -m 30000 --prompt "What is generative AI"

It generated 30k tokens in about 11 mins. The fan was going full speed and the power draw was consistently 30-35 watts on an M3 max.

Here's the stats:

Prompt: 15 tokens, 50.681 tokens-per-sec
Generation: 30000 tokens, 46.960 tokens-per-sec
Peak memory: 8.126 GB

awni · 2024-11-01T19:38:22Z

Now I'm wondering what we are doing differently?

N8python · 2024-11-01T19:51:05Z

Is it a 14 inch or 16inch m3 max...?

awni · 2024-11-01T19:53:41Z

16 inch
64 GB
OS 15.0.1

MLX on main
MLX LM on main

awni · 2024-11-01T19:54:04Z

I'm wondering how much RAM you have? Maybe it's swapping and that's what accounts for the cliff?

N8python · 2024-11-01T19:56:32Z

64GB 14inch M3 Max MLX LM (pretty much latest version) It's thermal throttling that occurs in smaller macs!

N8python · 2024-11-01T20:00:49Z

I ran your exact thing on my 14-inch. It has now dropped to ~1.8W and is stuttering horrbbly as it desperately tries to cool down.

awni · 2024-11-01T20:05:58Z

Huh, so what happens if you try to train on it? Does it hit the same perf cliff?

N8python · 2024-11-01T20:15:05Z

Oh training does the exact same thing - LORAing always makes the computer throttle brutally.

awni · 2024-11-01T20:19:18Z

Oh training does the exact same thing - LORAing always makes the computer throttle brutally.

Could you share some rough numbers on toks/sec pre and post throttling?

N8python · 2024-11-01T20:20:35Z

Im still waiting for the benchmark to complete. It runs at the same tok/sec you report when non-throttled. But I'll report the avg tok/sec on the 30000 tok generation task.

awni · 2024-11-01T20:22:33Z

Thanks! Also curious for LoRA fine-tuning if you have anything readily available. No worries if not.

N8python · 2024-11-01T20:31:07Z

Here's the benchmark:

Prompt: 15 tokens, 138.944 tokens-per-sec
Generation: 30000 tokens, 16.352 tokens-per-sec
Peak memory: 8.346 GB

(The slowdown on LORA is similar - roughly 1/3rd of what the max throughput is)

N8python · 2024-11-02T01:46:52Z

So yeah - do you think this would be a welcome change?

awni · 2024-11-02T02:13:11Z

Let’s keep the PR open for now. I’m not done investigating this yet. We may or may not merge it depending.. but I appreciate you helping us figure out the underlying issue.

N8python · 2024-11-02T16:27:01Z

Makes sense! Thanks for your openness in this investigation :D

N8python · 2024-11-02T20:03:19Z

Example of what happens during LORA (or in this case full) finetuning of SmolLM2 135M:

Iter 560: Train loss 2.217, Learning Rate 3.000e-05, It/sec 2.906, Tokens/sec 1507.738, Trained Tokens 391391, Peak mem 9.893 GB
Iter 570: Train loss 2.091, Learning Rate 3.000e-05, It/sec 1.618, Tokens/sec 1547.780, Trained Tokens 400957, Peak mem 9.893 GB
Iter 580: Train loss 1.820, Learning Rate 3.000e-05, It/sec 1.877, Tokens/sec 1077.392, Trained Tokens 406696, Peak mem 9.893 GB
Iter 590: Train loss 1.919, Learning Rate 3.000e-05, It/sec 1.490, Tokens/sec 1374.364, Trained Tokens 415923, Peak mem 9.893 GB
Iter 600: Train loss 2.071, Learning Rate 3.000e-05, It/sec 2.043, Tokens/sec 1326.902, Trained Tokens 422418, Peak mem 9.893 GB
Iter 600: Saved adapter weights to adapters/adapters.safetensors and adapters/0000600_adapters.safetensors.
Iter 610: Train loss 2.017, Learning Rate 3.000e-05, It/sec 1.077, Tokens/sec 791.562, Trained Tokens 429771, Peak mem 9.893 GB
Iter 620: Train loss 2.397, Learning Rate 3.000e-05, It/sec 1.575, Tokens/sec 985.628, Trained Tokens 436027, Peak mem 9.893 GB
Iter 630: Train loss 2.019, Learning Rate 3.000e-05, It/sec 1.401, Tokens/sec 817.147, Trained Tokens 441861, Peak mem 9.893 GB
Iter 640: Train loss 1.931, Learning Rate 3.000e-05, It/sec 1.037, Tokens/sec 708.138, Trained Tokens 448692, Peak mem 9.893 GB
Iter 650: Train loss 2.295, Learning Rate 3.000e-05, It/sec 0.938, Tokens/sec 483.847, Trained Tokens 453853, Peak mem 9.893 GB
Iter 660: Train loss 1.740, Learning Rate 3.000e-05, It/sec 0.540, Tokens/sec 368.923, Trained Tokens 460687, Peak mem 9.893 GB
Iter 670: Train loss 1.884, Learning Rate 3.000e-05, It/sec 0.218, Tokens/sec 176.933, Trained Tokens 468785, Peak mem 9.893 GB
Iter 680: Train loss 2.026, Learning Rate 3.000e-05, It/sec 0.264, Tokens/sec 206.046, Trained Tokens 476577, Peak mem 9.893 GB
Iter 690: Train loss 2.112, Learning Rate 3.000e-05, It/sec 0.230, Tokens/sec 211.577, Trained Tokens 485780, Peak mem 9.893 GB

ivanfioravanti · 2024-11-22T23:07:08Z

I confirm that 16" is not affected with fan at Max speed, while 14" is really impacted. Both M3 Max and M4 Max models. Slowing down generation can help to reduce temp.

ivanfioravanti · 2024-12-15T16:31:36Z

Not true, MBP 16" is impacted too, being able to slowdown MLX would help to avoid throttling and keep Mac less noisy

N8 added 2 commits October 31, 2024 02:20

add max token limit

7e4413b

smol modification

e6d3530

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Max Token Limit for Generation #1078

Add Max Token Limit for Generation #1078

N8python commented Oct 31, 2024

awni commented Oct 31, 2024

N8python commented Oct 31, 2024

awni commented Oct 31, 2024

N8python commented Oct 31, 2024

awni commented Oct 31, 2024

N8python commented Oct 31, 2024

awni commented Oct 31, 2024

awni commented Oct 31, 2024

N8python commented Oct 31, 2024

N8python commented Oct 31, 2024

N8python commented Nov 1, 2024

awni commented Nov 1, 2024

awni commented Nov 1, 2024

N8python commented Nov 1, 2024

awni commented Nov 1, 2024

awni commented Nov 1, 2024

N8python commented Nov 1, 2024

N8python commented Nov 1, 2024

awni commented Nov 1, 2024

N8python commented Nov 1, 2024

awni commented Nov 1, 2024

N8python commented Nov 1, 2024

awni commented Nov 1, 2024

N8python commented Nov 1, 2024

N8python commented Nov 2, 2024

awni commented Nov 2, 2024

N8python commented Nov 2, 2024

N8python commented Nov 2, 2024

ivanfioravanti commented Nov 22, 2024

ivanfioravanti commented Dec 15, 2024

Add Max Token Limit for Generation #1078

Are you sure you want to change the base?

Add Max Token Limit for Generation #1078

Conversation

N8python commented Oct 31, 2024

awni commented Oct 31, 2024

N8python commented Oct 31, 2024

awni commented Oct 31, 2024

N8python commented Oct 31, 2024

awni commented Oct 31, 2024

N8python commented Oct 31, 2024

awni commented Oct 31, 2024

awni commented Oct 31, 2024

N8python commented Oct 31, 2024

N8python commented Oct 31, 2024

N8python commented Nov 1, 2024

awni commented Nov 1, 2024

awni commented Nov 1, 2024

N8python commented Nov 1, 2024

awni commented Nov 1, 2024

awni commented Nov 1, 2024

N8python commented Nov 1, 2024

N8python commented Nov 1, 2024

awni commented Nov 1, 2024

N8python commented Nov 1, 2024

awni commented Nov 1, 2024

N8python commented Nov 1, 2024

awni commented Nov 1, 2024

N8python commented Nov 1, 2024

N8python commented Nov 2, 2024

awni commented Nov 2, 2024

N8python commented Nov 2, 2024

N8python commented Nov 2, 2024

ivanfioravanti commented Nov 22, 2024

ivanfioravanti commented Dec 15, 2024