Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal support with Phi 3 Vision + Transformers #1020

Open
wants to merge 318 commits into
base: main
Choose a base branch
from

Conversation

nking-1
Copy link
Contributor

@nking-1 nking-1 commented Sep 11, 2024

Adds a general framework for supporting multimodal models in Guidance, as well as an implementation of Phi 3 Vision, using the transformers library.

  • Refactored some code in _parser.py. Changed the __init__ function for TokenParser so that it contains less logic. This was necessary to implement more flexible processing of the prompt to get the token IDs when dealing with various data formats. Now engines can do more specific preparation of the TokenParser if they. need to.
  • Added Phi 3 vision chat template
  • Model and engine refactor
    • Created Modality enum which is used to indicate the modality of data in a prompt segment. Some types have been created to indicate that the data points to a URL. This might enable us to support APIs like Gemini in the future, which needs users to supply a Google Cloud URL for the blob data in the API request.
    • The model state is still stored as a string. Multimodal data is stored in the model object's key-value store. The data is pointed to by its ID. The multimodal data is encoded like this in the prompt: <|_{modality.name}:{str(id(data))}|>. It's essentially a placeholder for where the larger blob data should be inserted later on when prompting the model.
    • Image, audio, and video bytes are appended to the model with functions like append_image(), append_audio_bytes(), etc. These functions are meant to be used by user-facing guidance library functions such as image() so those guidance functions can load the data into the model state.
    • When the model calls the engine, the multimodal blob data must be passed to the engine, which might live in a separate process or server somewhere. To allow this, a new media dict parameter was added to __call__(), get_next_token(), and get_logits() in Engine. There is a default implementation of these provided in the Engine base class. Subclasses of Engine can override these functions as necessary to parse the prompt string and pack the media data as needed depending on the API.
    • The prompt string parameter sent to engine will still contain the placeholders formatted like <|_{modality.name}:{str(id(data))}|>. Engines will parse this string and extract the ID part using a regex. This ID is used to map to the actual blob data in the media dict: {id: blob_data}
    • get_next_token() and get_logits() also receive the media dict parameter because sometimes engines will need the media data, along with the prompt string, at that particular point in time to prepare an API request. The idea is to ensure there's enough flexibility to handle various kinds of models.
  • A new hack was included in Transformers Tokenizer creation to accommodate for phi 3 vision's tokenizer, which uses sentencepiece convention for encoding spaces, but is not an sp_model itself
  • A specific class was written for TransformersPhi3VisionEngine. I found it difficult to subclass the existing Transformers classes to add the needed new code. I think it's best to consider this new TransformersPhi3VisionEngine as a prototype for what a multimodal Transformers engine looks like. For now, there were some Phi 3 behaviors I had to account for. In the future as we support more multimodal Transformers models, we might notice specific patterns arising that we can use to make the implementation cleaner and more general
    • Multimodal Transformers models use an AutoProcessor instead of AutoTokenizer to prepare model inputs. The model inputs for Phi 3 include token ids, image pixel values, and an attention mask
    • Phi 3 vision uses a convention of negative token ids to pad the tokens with the space needed to fit the image embeddings in model input. The negative ids correspond to the image index. If there are 3 image inputs, then you would see tokens -1, -2, and -3 for example. Note that Phi 3 vision is only trained on 1 image input, though. On HuggingFace they have stated their stance is to allow people to fine-tune for multi-image use cases if they want: https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/discussions/60
    • The input tokens might look something like this for a prompt like "Hello <image_1> what is this <image_2>?: [30, 50, 22, -1, -1, -1, -1, -1, -1, ..., -1, 54, 893, 250, -2, -2, -2, -2, -2, ..., -2, 542]
  • LL Guidance has a process_prompt() function which does the initial token healing and grammar forcing for a prompt string. For multimodal prompts, there are boundaries between text data and multimodal data in the token space. Token healing or forcing cannot be applied across those boundaries. So, we will preserve the existing tokens provided by the initial tokenization, then only send the text tokens at the end of the prompt, after all multimodal data, to process_prompt(). If the prompt ends with a multimodal input, we will not use process_prompt(). We might improve on this later on, but it seems to work for now.

TODO:

  • Fix and improve tests

@@ -72,7 +88,7 @@ def get_chat_template(self): # TODO [HN]: Add more logic here...should we instan
def reset_metrics(self):
self.metrics = GuidanceEngineMetrics()

def start(self, prompt, grammar, ensure_bos_token=True) -> TokenParser:
def start(self, prompt, grammar, media: dict, ensure_bos_token=True) -> TokenParser:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make "media" optional?

Comment on lines 406 to 407
"""The current prompt in bytes (which is the state without the context close tags)."""
return format_pattern.sub("", self._state)
Copy link
Collaborator

@Harsha-Nori Harsha-Nori Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slundberg do you mind if I get rid of this?

Comment on lines 659 to 677
def append_image(self, image):
"""
Appends image bytes to the model's state.
"""
return self._append_multimodal(image, Modality.IMAGE)


def append_audio_bytes(self, audio):
"""
Appends audio bytes to the model's state.
"""
return self._append_multimodal(audio, Modality.AUDIO)


def append_video_bytes(self, video):
"""
Appends video bytes to the model's state.
"""
return self._append_multimodal(video, Modality.VIDEO)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just collapse these into the OG _append_multimodal?

if isinstance(prompt, bytes):
prompt = prompt.decode("utf-8")
elif isinstance(prompt, str):
prompt = prompt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe missing a str -> byte conversion, or skip conditional altogether?

Copy link
Collaborator

@hudson-ai hudson-ai Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Side note, but I want us to be a bit more careful about strings vs bytes in general -- maybe some annotations would help

Comment on lines +90 to +94
model_inputs = self.processor(
text=processed_prompt,
images=images if len(images) > 0 else None,
return_tensors="pt",
).to(self.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure that this gets called if images are added again later, after the model starts generating. Otherwise we should trigger a "re-process" upon the next user image being inserted right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be good to add some tests using unittest.patch if there are any assertions around how many times this should be called for a given image, what should be reused, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we'll have to re-process when new images are added. I'll look into how to tell if re-processing is necessary. For the moment, I actually haven't seen a Transformers model yet that was trained to support multiple images during inference. As far as I can tell, Phi 3 vision and llama 3.2 were trained for 1 image per context. That being said, we should still support multi-image scenarios here.

return self._cached_logits


class TransformersPhi3Vision(Model):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hope we can get rid of this 🥺

@@ -37,6 +37,7 @@
"openai": ["openai>=1.0"],
"schemas": ["jsonschema"],
"server": ["fastapi-slim", "uvicorn"],
"image": ["pillow"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☁️

):
# add the beginning of sequence token if needed
processed_tokens = [bos_token_id] + processed_tokens

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tokens will likely need to be recoded before being sent to the LLM for logits (I think only in the case that we just added a BOS token).

You could probably just throw that line in create_token_parser after calling process_prompt..? Just since you'll have access to a tokenizer there.

See tests/model_integration/test_model.py::test_associativity

Comment on lines 139 to 142
if ensure_bos_token and tokenizer.bos_token_id is not None:
bos_token_id = tokenizer.bos_token_id
else:
bos_token_id = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nitpick -- you don't need to check the bos_token_id against None in the if statement if you're assigning to None in the else clause

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should throw an error in the case where ensure_bos_token is True and tokenizer.bos_token_id is None?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm good question. To my knowledge, we're never calling this with ensure_bos_token set to False... In other words, I'm not sure the kwarg is really providing any value. I don't think an exception is really necessary in this case. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just going to log a warning in that case for now

return processed_tokens


def process_grammar(grammar: Union[GrammarFunction, str]) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give this a more informative name? E.g. serialize_grammar

Comment on lines +74 to +79
match_str = match.group(0)
modality_type = match.group(1)
if modality_type != Modality.IMAGE.name:
logger.debug("Skipping non-image modality: %s", match_str)
continue
media_id = match.group(2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Named groups and a call to groupdict may yield something a little more readable and less fragile to changes in the pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting suggestion, I will look into that

@nking-1
Copy link
Contributor Author

nking-1 commented Oct 1, 2024

Thanks for your reviews Hudson & Harsha! I am picking this PR back up now and will make revisions based on your feedback. I'm also going to work on integrating Llama 3.2 to come up with a more general solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants