Multimodal support with Phi 3 Vision + Transformers #1020

nking-1 · 2024-09-11T01:30:19Z

Adds a general framework for supporting multimodal models in Guidance, as well as an implementation of Phi 3 Vision, using the transformers library.

Refactored some code in _parser.py. Changed the __init__ function for TokenParser so that it contains less logic. This was necessary to implement more flexible processing of the prompt to get the token IDs when dealing with various data formats. Now engines can do more specific preparation of the TokenParser if they. need to.
Added Phi 3 vision chat template
Model and engine refactor
- Created Modality enum which is used to indicate the modality of data in a prompt segment. Some types have been created to indicate that the data points to a URL. This might enable us to support APIs like Gemini in the future, which needs users to supply a Google Cloud URL for the blob data in the API request.
- The model state is still stored as a string. Multimodal data is stored in the model object's key-value store. The data is pointed to by its ID. The multimodal data is encoded like this in the prompt: <|_{modality.name}:{str(id(data))}|>. It's essentially a placeholder for where the larger blob data should be inserted later on when prompting the model.
- Image, audio, and video bytes are appended to the model with functions like append_image(), append_audio_bytes(), etc. These functions are meant to be used by user-facing guidance library functions such as image() so those guidance functions can load the data into the model state.
- When the model calls the engine, the multimodal blob data must be passed to the engine, which might live in a separate process or server somewhere. To allow this, a new media dict parameter was added to __call__(), get_next_token(), and get_logits() in Engine. There is a default implementation of these provided in the Engine base class. Subclasses of Engine can override these functions as necessary to parse the prompt string and pack the media data as needed depending on the API.
- The prompt string parameter sent to engine will still contain the placeholders formatted like <|_{modality.name}:{str(id(data))}|>. Engines will parse this string and extract the ID part using a regex. This ID is used to map to the actual blob data in the media dict: {id: blob_data}
- get_next_token() and get_logits() also receive the media dict parameter because sometimes engines will need the media data, along with the prompt string, at that particular point in time to prepare an API request. The idea is to ensure there's enough flexibility to handle various kinds of models.
A new hack was included in Transformers Tokenizer creation to accommodate for phi 3 vision's tokenizer, which uses sentencepiece convention for encoding spaces, but is not an sp_model itself
A specific class was written for TransformersPhi3VisionEngine. I found it difficult to subclass the existing Transformers classes to add the needed new code. I think it's best to consider this new TransformersPhi3VisionEngine as a prototype for what a multimodal Transformers engine looks like. For now, there were some Phi 3 behaviors I had to account for. In the future as we support more multimodal Transformers models, we might notice specific patterns arising that we can use to make the implementation cleaner and more general
- Multimodal Transformers models use an AutoProcessor instead of AutoTokenizer to prepare model inputs. The model inputs for Phi 3 include token ids, image pixel values, and an attention mask
- Phi 3 vision uses a convention of negative token ids to pad the tokens with the space needed to fit the image embeddings in model input. The negative ids correspond to the image index. If there are 3 image inputs, then you would see tokens -1, -2, and -3 for example. Note that Phi 3 vision is only trained on 1 image input, though. On HuggingFace they have stated their stance is to allow people to fine-tune for multi-image use cases if they want: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/discussions/60
- The input tokens might look something like this for a prompt like "Hello <image_1> what is this <image_2>?: [30, 50, 22, -1, -1, -1, -1, -1, -1, ..., -1, 54, 893, 250, -2, -2, -2, -2, -2, ..., -2, 542]
LL Guidance has a process_prompt() function which does the initial token healing and grammar forcing for a prompt string. For multimodal prompts, there are boundaries between text data and multimodal data in the token space. Token healing or forcing cannot be applied across those boundaries. So, we will preserve the existing tokens provided by the initial tokenization, then only send the text tokens at the end of the prompt, after all multimodal data, to process_prompt(). If the prompt ends with a multimodal input, we will not use process_prompt(). We might improve on this later on, but it seems to work for now.

TODO:

Fix and improve tests

Harsha-Nori · 2024-09-12T00:17:50Z

guidance/models/_model.py

@@ -72,7 +88,7 @@ def get_chat_template(self): # TODO [HN]: Add more logic here...should we instan
    def reset_metrics(self):
        self.metrics = GuidanceEngineMetrics()

-    def start(self, prompt, grammar, ensure_bos_token=True) -> TokenParser:
+    def start(self, prompt, grammar, media: dict, ensure_bos_token=True) -> TokenParser:


Should we make "media" optional?

Harsha-Nori · 2024-09-12T00:21:41Z

guidance/models/_model.py

        """The current prompt in bytes (which is the state without the context close tags)."""
        return format_pattern.sub("", self._state)


@slundberg do you mind if I get rid of this?

Harsha-Nori · 2024-09-12T00:24:55Z

guidance/models/_model.py

+    def append_image(self, image):
+        """
+        Appends image bytes to the model's state.
+        """
+        return self._append_multimodal(image, Modality.IMAGE)
+
+
+    def append_audio_bytes(self, audio):
+        """
+        Appends audio bytes to the model's state.
+        """
+        return self._append_multimodal(audio, Modality.AUDIO)
+
+
+    def append_video_bytes(self, video):
+        """
+        Appends video bytes to the model's state.
+        """
+        return self._append_multimodal(video, Modality.VIDEO)


Maybe we can just collapse these into the OG _append_multimodal?

Harsha-Nori · 2024-09-12T00:31:11Z

guidance/models/transformers/_transformers_phi3v.py

+        if isinstance(prompt, bytes):
+            prompt = prompt.decode("utf-8")
+        elif isinstance(prompt, str):
+            prompt = prompt


Maybe missing a str -> byte conversion, or skip conditional altogether?

Agreed. Side note, but I want us to be a bit more careful about strings vs bytes in general -- maybe some annotations would help

Harsha-Nori · 2024-09-12T00:41:32Z

guidance/models/transformers/_transformers_phi3v.py

+        model_inputs = self.processor(
+            text=processed_prompt,
+            images=images if len(images) > 0 else None,
+            return_tensors="pt",
+        ).to(self.device)


Let's make sure that this gets called if images are added again later, after the model starts generating. Otherwise we should trigger a "re-process" upon the next user image being inserted right?

Could be good to add some tests using unittest.patch if there are any assertions around how many times this should be called for a given image, what should be reused, etc.

I believe we'll have to re-process when new images are added. I'll look into how to tell if re-processing is necessary. For the moment, I actually haven't seen a Transformers model yet that was trained to support multiple images during inference. As far as I can tell, Phi 3 vision and llama 3.2 were trained for 1 image per context. That being said, we should still support multi-image scenarios here.

Harsha-Nori · 2024-09-12T00:42:10Z

guidance/models/transformers/_transformers_phi3v.py

+        return self._cached_logits
+
+
+class TransformersPhi3Vision(Model):


hope we can get rid of this 🥺

Harsha-Nori · 2024-09-12T00:43:46Z

setup.py

@@ -37,6 +37,7 @@
    "openai": ["openai>=1.0"],
    "schemas": ["jsonschema"],
    "server": ["fastapi-slim", "uvicorn"],
+    "image": ["pillow"]


hudson-ai · 2024-09-12T21:53:05Z

guidance/_parser.py

+    ):
+        # add the beginning of sequence token if needed
+        processed_tokens = [bos_token_id] + processed_tokens
+


These tokens will likely need to be recoded before being sent to the LLM for logits (I think only in the case that we just added a BOS token).

You could probably just throw that line in create_token_parser after calling process_prompt..? Just since you'll have access to a tokenizer there.

See tests/model_integration/test_model.py::test_associativity

hudson-ai · 2024-09-12T21:54:46Z

guidance/_parser.py

+    if ensure_bos_token and tokenizer.bos_token_id is not None:
+        bos_token_id = tokenizer.bos_token_id
+    else:
+        bos_token_id = None


Tiny nitpick -- you don't need to check the bos_token_id against None in the if statement if you're assigning to None in the else clause

Do you think we should throw an error in the case where ensure_bos_token is True and tokenizer.bos_token_id is None?

Mmm good question. To my knowledge, we're never calling this with ensure_bos_token set to False... In other words, I'm not sure the kwarg is really providing any value. I don't think an exception is really necessary in this case. What do you think?

I'm just going to log a warning in that case for now

guidance/models/_model.py

hudson-ai · 2024-09-12T22:13:32Z

guidance/_parser.py

+    return processed_tokens
+
+
+def process_grammar(grammar: Union[GrammarFunction, str]) -> str:


Can we give this a more informative name? E.g. serialize_grammar

hudson-ai · 2024-09-12T22:15:49Z

guidance/models/transformers/_transformers_phi3v.py

+            match_str = match.group(0)
+            modality_type = match.group(1)
+            if modality_type != Modality.IMAGE.name:
+                logger.debug("Skipping non-image modality: %s", match_str)
+                continue
+            media_id = match.group(2)


Named groups and a call to groupdict may yield something a little more readable and less fragile to changes in the pattern

That's an interesting suggestion, I will look into that

nking-1 · 2024-10-01T21:50:52Z

Thanks for your reviews Hudson & Harsha! I am picking this PR back up now and will make revisions based on your feedback. I'm also going to work on integrating Llama 3.2 to come up with a more general solution.

hudson-ai and others added 30 commits June 17, 2024 16:08

int -> bytes

ae1baa6

Explicitly pass kwargs to EngineCallResponse

077e4b9

matched

ede3333

Add some comments

ed4b220

Merge branch 'main' into lazy_grammars

d26536c

done is callable now

6f98db5

LLGUIDANCE_LOG_LEVEL

7848c09

Prelim greedy json

2351793

Move temperature to get_logits

c8f89fc

captures already decoded

ecb902c

More helpful exceptions

46c1cfb

Consume bytes in init

1fd2445

valid_next_bytes

791ce57

next_byte_mask

6293930

adapt parser tests

ffd51b3

Fix ParserExceptions

97a91fc

Serialize ByteRange as if they are wrapped in GenCommitPoint

161cd9f

Typo

eadd79e

Epsilon for repr of grammars with null

126aaa1

Byte(b".") -> b"."

55b3079

black

3531c43

more inclusive number schema

c59d1a2

Byte(b".") -> b"."

36301e8

no more Byte/ByteRange in tests

bfe67c2

use eos token as stop token

0ebb59c

fix LLGUIDANCE_LOG_LEVEL

4423906

make string lexemes contextual

ee133e7

cache json definitions

7f35ed0

Pass max_tokens to json

60855b6

<Bandaid> inject BOS token after process_prompt

a6e7f33

nking-1 added 13 commits August 22, 2024 13:22

image loading and passing to model

2a0f3f7

save phi-3 vision notebook

9384b25

Merge remote-tracking branch 'upstream/main' into multimodal_2

c7b89fc

Fix tokenizer issue with phi3vision (hack, probably needs review)

d474e6a

phi 3 vision chat template

38cecb1

Merge branch 'main' into multimodal_2

b4a2947

dev notebooks for llguidance prompt processing

d7c5c10

experimental phi 3 vision testing scripts

cc8ac87

constraints tests for guidance + img

73fa881

Refactoring and cleanup of transformers & phi3v code

761326b

Merge branch 'main' into multimodal_2

a135311

KV caching for phi 3 vision

160a449

Code cleanup - remove dev code

2b7410b

Harsha-Nori reviewed Sep 12, 2024

View reviewed changes

Small fixes to parameter types and logic

4b46880

hudson-ai reviewed Sep 12, 2024

View reviewed changes

guidance/models/_model.py Show resolved Hide resolved

hudson-ai reviewed Sep 12, 2024

View reviewed changes

Minor code cleanup

105d648

nking-1 added 2 commits October 7, 2024 22:03

Merge remote-tracking branch 'upstream/main' into phi3vision

af9d11b

parser PR feedback

ee91785

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal support with Phi 3 Vision + Transformers #1020

Multimodal support with Phi 3 Vision + Transformers #1020

nking-1 commented Sep 11, 2024 •

edited

Loading

Harsha-Nori Sep 12, 2024

Harsha-Nori Sep 12, 2024 •

edited

Loading

Harsha-Nori Sep 12, 2024

Harsha-Nori Sep 12, 2024

hudson-ai Sep 12, 2024 •

edited

Loading

Harsha-Nori Sep 12, 2024

hudson-ai Sep 12, 2024

nking-1 Oct 7, 2024

Harsha-Nori Sep 12, 2024

Harsha-Nori Sep 12, 2024

hudson-ai Sep 12, 2024

hudson-ai Sep 12, 2024

nking-1 Oct 7, 2024

hudson-ai Oct 7, 2024

nking-1 Oct 7, 2024

hudson-ai Sep 12, 2024

hudson-ai Sep 12, 2024

nking-1 Oct 1, 2024

nking-1 commented Oct 1, 2024

		"""The current prompt in bytes (which is the state without the context close tags)."""
		return format_pattern.sub("", self._state)

		return self._cached_logits


		class TransformersPhi3Vision(Model):

		return processed_tokens


		def process_grammar(grammar: Union[GrammarFunction, str]) -> str:

Multimodal support with Phi 3 Vision + Transformers #1020

Are you sure you want to change the base?

Multimodal support with Phi 3 Vision + Transformers #1020

Conversation

nking-1 commented Sep 11, 2024 • edited Loading

Choose a reason for hiding this comment

Harsha-Nori Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudson-ai Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nking-1 commented Oct 1, 2024

nking-1 commented Sep 11, 2024 •

edited

Loading

Harsha-Nori Sep 12, 2024 •

edited

Loading

hudson-ai Sep 12, 2024 •

edited

Loading