Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add copilot server example #23

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

chenhunghan
Copy link

This PR adds an example HTTP server wrapping exllamav2, which can be used as the server replacing Github Copilot backend.

Signed-off-by: Hung-Han (Henry) Chen <[email protected]>
@19h
Copy link
Contributor

19h commented Sep 14, 2023

Wow, this is pretty cool 👍

@KaruroChori
Copy link

I am having issues with cache = ExLlamaV2Cache(model) failing because of

File "/app/exllamav2/cache.py", line 25, in __init__
    p_key_states = torch.zeros(self.batch_size, self.max_seq_len, num_key_value_heads, head_dim, dtype = torch.float16, device = self.model.cache_map[i])
KeyError: 0

I made some minor modifications to the code so that the model is not downloaded from huggingface, and changed the path to match the one I used for the docker config I added. Still, I don't think this was the cause.
The model used is a 13B llama v2 which is working fine with chat.py.

Was anyone else successful with the original sourcecode in this PR?

@KaruroChori
Copy link

KaruroChori commented Sep 15, 2023

I found the issue. You code was missing

model.load()

before cache is reserved.
I need to test if the rest is working, but at least now it does not halt.

@chenhunghan
Copy link
Author

I found the issue. You code was missing

model.load()

before cache is reserved. I need to test if the rest is working, but at least now it does not halt.

Thank you :)

Signed-off-by: Hung-Han (Henry) Chen <[email protected]>
Signed-off-by: Hung-Han (Henry) Chen <[email protected]>
Signed-off-by: Hung-Han (Henry) Chen <[email protected]>
Signed-off-by: Hung-Han (Henry) Chen <[email protected]>
@chenhunghan
Copy link
Author

I have fixed few bugs, it's more or less in working status. Has been tested with cloud GPUs.

@SinanAkkoyun
Copy link
Contributor

That is very cool. How do insertions work here? Copilot is trained to insert code "in the middle", is that also possible with this endpoint or is the only thing it receives the previous code?

@chenhunghan
Copy link
Author

That is very cool. How do insertions work here? Copilot is trained to insert code "in the middle", is that also possible with this endpoint or is the only thing it receives the previous code?

Only the previous code.

@Skoolin
Copy link

Skoolin commented Dec 19, 2023

This is great! Is there a plan to include code insertion / infilling? Codellama has been trained with infilling, but would need the special tokens <PRE>, <MID> and <SUF>. Does the llama tokenizer implementation of exllamav2 support those tokens? Otherwise I might try to implement those, as that would be really useful to me.

@chenhunghan
Copy link
Author

I don't have plan as this was just for fun. You can definitely add yours, enjoy the hack 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants