-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prompt caching in mlx_lm.server
#1026
Conversation
This would not be back wards compatible with any later incorporation of batched input for generate (i.e., #948) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave door for open for (#948)
@@ -474,15 +531,15 @@ def handle_completion( | |||
|
|||
def handle_stream( | |||
self, | |||
prompt: mx.array, | |||
prompt: List[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This in particular. Can't we handle a single or multiple (batched) prompt that falls back to the behavior for a single prompt by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always update this type to List[List[int]]
when the time comes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fantastic! I left a few comments that may or may not need addressing.
Added a basic prompt cache in
mlx_lm.server
for chat mode. But it does support chatting with cache reuse.