feat(mlx_lm): support batch input in `generate()` #948

llllvvuu · 2024-08-21T05:08:23Z

The prompt argument can now be either a str or list[str].

This is based on @willccbb's implementation at https://github.com/willccbb/mlx_parallm; I noticed that it aligned with the KVCache upgrades in #911.

The change to generate() is backwards-compatible.

The changes to generate_step(), top_p_sampling(), and min_p_sampling() are backwards-incompatible in order to unify shapes; this could be changed by adding a few if-statements, if preferred.

llllvvuu · 2024-08-26T05:48:38Z

Kind of interesting: for quantized models, the throughput is doesn't go up a lot between small bs (bs=1,2,3,4), but then it starts to go up a lot at higher bs, which is the opposite of what I expected intuitively. For unquantized models the throughput does goes up between small bs. I observe the same on @willccbb's original repo.

The `prompt` argument can now be either a `str` or `list[str]`. The change to `generate()` is backwards-compatible. The changes to `generate_step()`, `top_p_sampling()`, and `min_p_sampling()` are backwards-incompatible in order to unify shapes; this could be changed by adding a few if-statements, if preferred.

awni · 2024-08-29T13:24:14Z

llms/mlx_lm/utils.py

-    prompt_tokens = mx.array(tokenizer.encode(prompt))
-    detokenizer = tokenizer.detokenizer
+    if is_batch:
+        tokenizer._tokenizer.padding_side = "left"


I see that we left pad shorter prompts here which makes sense. But one thing that I'm wondering is how this is handled in the causal models if at all? Shouldn't the causal mask take into account the padding?

I didn't handle it, generation seems OK without it but indeed to be correct I should consume the tokenizer._tokenizer(prompt, padding=True)["attention_mask"]. To do this I would need to update our model APIs to have attention_mask as an input similar to how transformers has model.generate taking attention_mask. Probably this is involves hitting every file in models/. Should mostly be copy/paste though. I can look into it.

awni · 2024-08-29T13:40:05Z

I think it makes sense to minimize the complexity to the generate function (which is becoming a bit spaghetti) to split out the batched generation into a separate function called batch_generate. I would simplify that function to have fewer arguments (like no formatter, no printing during generation, verbose only prints the timings (e.g. as you have it now).

Also maybe more tricky is the fact that I think for this to be correct, the causal masks need to consider the left padding in the input (please correct me if I'm wrong about that). This has two implications:

Probably we'd need to add a mask parameter to the model __call__ functions and provide an appropriately constructed mask for the batch case.
The Rotating KV cache will be broken in this case (it keeps the initial tokens which would be the padded tokens) and when rotates the mask would need to be updated to consider the padding (which is a bit complicated/tedious). In this case I may suggest disabling this option entirely..

Let me know what you think about the above.

llllvvuu · 2024-08-29T13:53:52Z

I think it makes sense to minimize the complexity to the generate function (which is becoming a bit spaghetti) to split out the batched generation into a separate function called batch_generate. I would simplify that function to have fewer arguments (like no formatter, no printing during generation, verbose only prints the timings (e.g. as you have it now).

Makes sense to me, will implement.

Also maybe more tricky is the fact that I think for this to be correct, the causal masks need to consider the left padding in the input (please correct me if I'm wrong about that). This has two implications:

Probably we'd need to add a mask parameter to the model __call__ functions and provide an appropriately constructed mask for the batch case.

Yes, this sounds straightforward enough.

The Rotating KV cache will be broken in this case (it keeps the initial tokens which would be the padded tokens) and when rotates the mask would need to be updated to consider the padding (which is a bit complicated/tedious). In this case I may suggest disabling this option entirely..

I'll do a bit of thinking if there's an easy way to handle this, otherwise I'll remove that parameter in batch_generate.

Will update when these changes are ready!

awni · 2024-09-27T19:32:02Z

@llllvvuu are you coming back to this?

llllvvuu · 2024-09-28T00:05:58Z

hey @awni , sorry for the delay, I'd been job hunting this month. I should be able to get back to this in ~a week

awni · 2024-09-28T00:18:41Z

No worries, just checking. I'll follow up in a week or so.

nath1295 · 2024-10-15T21:47:27Z

Just realised the attention mask has been mentioned in this PR, which is the reason I raised this issue #1044

llllvvuu changed the title ~~feat: support batch input in generate()~~ feat(mlx_lm): support batch input in generate() Aug 21, 2024

llllvvuu force-pushed the feat/batch_generate branch from 7332759 to 332a713 Compare August 21, 2024 05:20

llllvvuu marked this pull request as draft August 21, 2024 05:22

llllvvuu force-pushed the feat/batch_generate branch from 332a713 to 12c6066 Compare August 21, 2024 05:25

llllvvuu marked this pull request as ready for review August 21, 2024 05:25

llllvvuu force-pushed the feat/batch_generate branch from 12c6066 to ef92993 Compare August 21, 2024 05:45

llllvvuu mentioned this pull request Aug 26, 2024

feat(mlx_lm): basic speculative decoding support in mlx_lm.generate / mlx_lm.server #954

Closed

llllvvuu added 2 commits August 29, 2024 05:13

feat: show batch generation progress

2caa832

llllvvuu force-pushed the feat/batch_generate branch from 5105b31 to 2caa832 Compare August 29, 2024 12:15

awni reviewed Aug 29, 2024

View reviewed changes

Merge branch 'main' into feat/batch_generate

8fb82fe

llllvvuu force-pushed the feat/batch_generate branch from bea0c4b to 8fb82fe Compare October 9, 2024 19:13

Merge branch 'main' into feat/batch_generate

9ee726c

llllvvuu force-pushed the feat/batch_generate branch from 308ad24 to 9ee726c Compare October 9, 2024 19:20

chimezie mentioned this pull request Oct 12, 2024

Prompt caching in mlx_lm.server #1026

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mlx_lm): support batch input in `generate()` #948

feat(mlx_lm): support batch input in `generate()` #948

llllvvuu commented Aug 21, 2024 •

edited

Loading

llllvvuu commented Aug 26, 2024 •

edited

Loading

awni Aug 29, 2024

llllvvuu Aug 29, 2024 •

edited

Loading

awni commented Aug 29, 2024

llllvvuu commented Aug 29, 2024

awni commented Sep 27, 2024

llllvvuu commented Sep 28, 2024

awni commented Sep 28, 2024

nath1295 commented Oct 15, 2024

feat(mlx_lm): support batch input in generate() #948

Are you sure you want to change the base?

feat(mlx_lm): support batch input in generate() #948

Conversation

llllvvuu commented Aug 21, 2024 • edited Loading

llllvvuu commented Aug 26, 2024 • edited Loading

awni Aug 29, 2024

Choose a reason for hiding this comment

llllvvuu Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

awni commented Aug 29, 2024

llllvvuu commented Aug 29, 2024

awni commented Sep 27, 2024

llllvvuu commented Sep 28, 2024

awni commented Sep 28, 2024

nath1295 commented Oct 15, 2024

feat(mlx_lm): support batch input in `generate()` #948

feat(mlx_lm): support batch input in `generate()` #948

llllvvuu commented Aug 21, 2024 •

edited

Loading

llllvvuu commented Aug 26, 2024 •

edited

Loading

llllvvuu Aug 29, 2024 •

edited

Loading