[WIP] Add a speculative decoding generator #1155

awni · 2024-12-14T01:58:20Z

Benchmarks on M3 Max:

Baseline:

mlx_lm.generate --model mlx-community/Qwen2.5-32B-Instruct-4bit --prompt "Write a quick sort in C++" -m 256

Prompt: 36 tokens, 86.936 tokens-per-sec
Generation: 256 tokens, 19.680 tokens-per-sec
Peak memory: 18.573 GB

With speculative decoding:

mlx_lm.generate --model mlx-community/Qwen2.5-32B-Instruct-4bit --prompt "Write a quick sort in C++" -m 256 --draft-model mlx-community/Qwen2.5-0.5B-Instruct-8bit --num-draft-tokens 4

Prompt: 36 tokens, 87.853 tokens-per-sec
Generation: 256 tokens, 35.738 tokens-per-sec
Peak memory: 19.112 GB

The outputs are identical.

A note on the implementation.. it seemed simpler to start to have a separate speculative_generate_step rather than try to merge everything. I might refactor a bit so they can use more functionality. I'm also not sold on wiring this through stream_generate. Could start by having it be a standalone thing that either builds on top of MLX LM or more standalone.. let me know thoughts if any..

add a speculative decoding generator

9eccd18

awni mentioned this pull request Dec 14, 2024

feat(mlx_lm): basic speculative decoding support in mlx_lm.generate / mlx_lm.server #954

Closed

awni added 3 commits December 13, 2024 18:44

fix

20b6d44

fixes

58f91ab

optional kwarg pop

c3f4e95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add a speculative decoding generator #1155

[WIP] Add a speculative decoding generator #1155

awni commented Dec 14, 2024 •

edited

Loading

[WIP] Add a speculative decoding generator #1155

Are you sure you want to change the base?

[WIP] Add a speculative decoding generator #1155

Conversation

awni commented Dec 14, 2024 • edited Loading

awni commented Dec 14, 2024 •

edited

Loading