-
-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] Llama 3.2 Vision Support (or already exists?) #658
Comments
Running a forward pass with vision tokens or occasionally running the hidden state through some cross-attention layers out of a HF implementation isn't the difficult part. The problem is all the auxiliary stuff. Where does the image data reside? How do you manage its lifetime? Should you preallocate memory for cross-attention keys/values? How many images do you support, one per context or multiple? Do you need separate implementations for Llava-type (token) models and Llama3.2-type (x-attn) models? Should you support batched generation in which some sequences contain images and some don't? What does that control flow look like? You need some idea of a use case to start answering questions like that, and use cases for image models tend to be hard to grapple with. As in, what is it you're hoping to do, at the end of the day? So far I haven't gotten a good answer to that question other than "I want to try out this VLM stuff." Which is a difficult thing to design an interface around. There is a partial implementation for image tokens, which I'm going to expand upon a bit, and I have some loose ideas for how that could work, API-wise. Of course, Llama3.2 completely breaks all of the assumptions I made in that regard, so I've had to reevaluate a few things. Perhaps I'll just not support it, idk. |
Check out how it works in VLLM and opendai vision (https://github.com/matatonic/openedai-vision). Some of this is solved already, at least in their ways. I am regularly using both image to text and inline images in sillytavern with models that support it. The use case is to show the AI something like a meme, an object to identify (what's this flower), have it do OCR on text, including foreign, etc. It is actually very useful. AI can also see it's own image gens and improve them... Someone made an AI navigate screens by using the image input + xy coordinates. I'm not sure it stays in the context past the original message but it would be interesting if it did (they never refer back to pics). It's not just llama 3.2 but pixtral, qwen-vl, et.. there is a whole host of vision models out there. You also really shouldn't quantize their layers below Q8 from my experience because the OCR gets really bad. I would love to use qwen-vl or a merge of 3.2 with something like hermes in my regular use, instead I am stuck doing transcription with florence or having to go back to gemini. Was going to try vllm with qwen2-vl as well since that's all that supports it in non transcription. |
I think supporting an extremely basic use case and seeing what people do is not a bad idea, and is better than no support! Batched operation is nice, but once you go there there are other options like vllm. Nothing beats Exllama for batch 1 quantized inference. Simple use case: single image introduced at the start of the conversation, with text Q&A on the image after that. We'd need to cache the initial cross attention states (and throw away the image), plus any caching for intermediate x-attn if we choose, and those need to live as long as that conversation lives. Each generated text token appends to the cache. Would this logic be too different than what's already done? Any change to initial image, or adding another image, I think will destroy all KV cache, same as changing egs., system prompt. I haven't looked at exactly how the hidden states are modified & if they are modified causally in that case, so adding an image might destroy all paged attention and caching (which is fine for a simple use case). There are no image tokens passed through the text layers in 3.2. They are all encoded at the initial layer and only get used in the x-attn layers. Text tokens refer to the images in-context, presumably the x-attn knows to pair each image in order with the corresponding From what I'm hearing, it should be possible to create a wrapper script that does this with Exllama right now, provided I give up KV caching and manually go layer-by-layer in python to intercept. Will give it a shot at least.
I was actually proposing leaving all x-attn in fp16, and technically we can process it outside Exllama framework.
I made one: https://huggingface.co/grimulkan/Llama-3.2-90B-Vision-Hermes-3-lorablated-merge, PM if you need other combos. But all fp16. |
In most implementations, the images get converted to base64. For sure in the API. Didn't check on kvcache but nothing seems to have problems, its just like any other tokens. The free endless gemini has kept me from trying to fire the really large ones up still. I totally look forward to magnum-qwenvl and trying hermes with vision layers. Even if AWQ takes a bite, it can't be that bad. Already at least 2 models came out with gan for generating too and then this will get even more complex. |
FWIW I still think it is worth keeping 3.2 support separate from general multi-modal support in Exllama. Maybe it motivates better support due to less work & variability. No image tokens flowing through Exllama, text output. |
Thing is they're not actually tokens, they're embeddings produced by another model. So they don't exist in the model's vocabulary, and any mechanism in the generator that relies on the one-to-one mapping between tokens IDs and their embeddings won't work. There's all sorts of complications from that. |
Made some progress on the hybrid Exllama+Transformers forward pass, hopefully will post it soon. I realized even though you can reference multiple images in-line in Llama 3.2, it doesn't actually work. The only thing that works is to have all the images up-front, and reference them all in the first prompt. Multiple images does work with that usage (and the model refers to it as 'left' image vs 'right' image if there are 2, etc.). Officially, it seems none of this is supported. That means we don't have to worry about invalidating the KV cache for now. Once the image embeddings are computed they do not change until we basically destroy everything and start again. Only their cross attentions for the next text token is computed, and the resulting information is appended to the KV cache in the self-attention layer (normal text behavior). So you don't really add new images to an existing conversation. |
It mentioned in their documentation that all images must be put upfront in the message. Which, i find it a bit weird. I wonder whether it will work at all in a multi turn conversation, and how it can track which question refer to which image. My testing so far for qwen2 vl show that they are much better in this regard. Nonetheless, i think this should be handled at user/ front end level. If a model requires user to put image before the text then user should do so in their chat, unless the code is hardcoded to just llama 3.2 |
Supposedly, you stick the Upon further testing, looks like the model just hallucinated before when it differentiated between the images for me. 2 images does NOT seem to work, in-line with the official response in my previous post. I don't know why the HF processing functions accept it, and go through the hassle of counting images to match The intended use-case seems to be: Single image -> multi-turn conversation about it. Yes, pretty much all the other vision models are more flexible about how they handle image(s), but Llama has the advantage of leaving the text layers untouched. Hopefully they will develop this further. I thought that would make Exllama integration easier, but turbo implies that's not the hard part. |
I pushed a bit more work on multimodal support. Example script in the dev branch here. It still needs some testing and consideration, but the basic idea at least seems to be workable for image feature tokens in Llava. It should work for Qwen2-VL as well, although that will require some updates to the RoPE since they have multidimensional positional embeddings for images. Even a time dimension for video, just to make it that much harder. :P But the API is sorting itself out, I think. Basically once you have your embeddings (still produced separately by a Transformers vision tower), it's: image_tokens = ExLlamaV2MMEmbedding(
model = model,
embeddings = image_features,
text_alias = "{{EMBED_HERE}}"
) And then the lifetime of the embeddings is managed that way. The tokenizer will respect the provided alias and insert a range of special tokens wherever it sees the string "{{EMBED_HERE}}", and the embedding layer of the LLM itself can respect those special tokens if the corresponding prompt = "[INST] {{EMBED_HERE}} Describe this image. [/INST]"
output = generator.generate(
prompt = prompt,
max_new_tokens = 200,
add_bos = True,
stop_conditions = [tokenizer.eos_token_id],
embeddings = [image_tokens],
) This also solves (continuous) batching, deduplication and some other issues, since the special tokens are unique to each image (or audio clip or video or whatever else), and you can use as many as you want per individual job. The lifetime of the embeddings is the lifetime of the generator job, unless you want to reuse the same embeddings for other jobs (e.g. a running context in a chat), in which case you would either explicitly keep a reference or use some sort of cache (probably best for an API server.) So I think this is very doable overall. It won't work for Llama 3.2, though. And I really don't know if it's even worth it since Qwen2-VL is already very good and so much more approachable than x-attn. |
A merged turbocat-vision will probably mog llama-3.2. They are supposed to release other multimodal models in the future which may not be so limited to one image at a time. |
Qwen2VL is a very strong model, I think this is a great compromise. |
Just wanted to mention: I got far enough to convince myself that you could run Llama 3.2 with the text portions quantized in Exllama (even with default text-only data for quantization), with the xattn layers unquantized. I haven't been motivated to clean up my code and post it though, or frankly even use it regularly, because there are much better options than Llama 3.2 for vision right now. If someone really cares, I guess post here for motivation! The only advantage I can see is that you can use pre-quantized text layer weights out of the box, but not sure that compensates for the lack of features and capability in the vision domain. |
I'm thinking if Llama-3.2V-11B-cot will make this worth it. But given QwenVL has a new QVQ-72B hmm... |
Problem
Wondering if basic support already exists.
Llama vision 3.2 is unlike #399, and in some ways may be very easy for basic Exllama integration (i.e., skipping quantization for the vision part and only quantizing & processing the language part with Exllama).
I do plan to experiment, but would welcome any tips/thoughts from turboderp or any others who have tried (and maybe failed).
Solution
The flow:
[3, 8, 13, 18, 23, 28, 33, 38, 43, 48, 53, 58, 63, 68, 73, 78, 83, 88, 93, 98]
(for 90B). For those layers, we need to pass the hidden layer outputs from the previous Exllama layer into the fp16 cross-attention layer + pass the cross attention states from earlier, then pass the hidden state output back to Exllama for the next text layer. All this done outside Exllama (using egs., HF transformer SDPA attention code). It intercepts & modifies the hidden states.That's it, final output is the same since it is still text output.
I think this can all be done in python from the way Exllama is structured, without modifying any Exllama engine code or changing inputs/outputs.
EXL2 quantization for the text portion of the model could be done the usual way with the default dataset to start, otherwise need to test it with multi-modal tokens (which could be integrated using the same flow as above in the quantizer). I'm guessing for >4bit quantization it won't be needed.
I'm also thinking all the KV caching tricks should still basically work? We'd need to do our own caching for the cross-attention layers in-between of course (or just, not).
Alternatives
No response
Explanation
Would be nice to be able to run vision models with fewer GPUs, since the language part is the bulk of the weights and are unchanged.
Examples
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: