-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Video-LLaVa now available in the Transformers library! #156
Comments
It's a great feat. Thank you for your generous help! |
@zucchini-nlp I'm seeing the following problem File "/home/rhelck/videotest.py", line 3, in The older example works fine for me, though. I reinstalled transfomers in a new venv for this by the way |
@rhelck hey! Did you install transformers from
|
@zucchini-nlp I want to distribute the model on multiple gpus. raise ValueError( |
@darshana1406 could you open this as issue in Also you are welcome to open a PR, if you think you are willing to, we are always happy for community contributions 🤗 |
@zucchini-nlp That worked perfectly, thanks! |
Can it also be used with images as before or only for videos? |
@IsabelJimenez99 , yes, the model can be used with images / videos / mix of image and video. Check out a colab notebook for inference examples with different input modalities |
Ah, ok. Sorry, I hadn't seen the collab. Thank you very much and excellent work. Congratulations! |
Can we use this library for fine-tuning as well or only for inference? If we can, is there documentation on how to use it properly? |
@BalloutAI Yes, we can. I am preparing a tutorial notebook for fine-tuning and will add it here, when it's done |
Thank you so much! Any expected timeline for that? |
@BalloutAI I made a short notebook for finetuning on a small dataset, you can find it here |
I am testing with the model ‘LanguageBind/Video-LLaVA-7B-hf’ and every time I run it on an image I get a different answer. I would like to know how much confidence the model has in each response, could I know? |
@IsabelJimenez99 You mean the model gives different generation every time, even if you keep the same image and prompt? That shouldn't be the case, can you share a minimal reproducible code? Regarding the model's confidence in each response, have a look at this thread which shows how to get probability of each generated token :) |
Yes, it's the same image, same prompt but different answers. The code I used is the same as the one shown in your collab. This is the code:
On the other hand, I have tested what has happened to me and they propose the following: However, I extrapolate that to their code and I get the following error: AttributeError: 'Tensor' object has no attribute 'sequences' |
@IsabelJimenez99 Ah I see now, the different outputs each time is expected in this case because you have set And for the second issue, you need to set "return_dict_in_generate=True, output_scores=True" in the generate kwargs to get scores in the output. Otherwise we only return the generated text. For more details of which arguments you can pass in kwargs and what they mean, see the docs 🤗 |
Oh! I understand now, thank you very much! And sorry for the inconvenience |
@zucchini-nlp |
@orrzohar yes, the model supports batching. For that you just have to pass the prompts as a list of strings, and also the list of visuals. Also you can do batching with different visual inputs: for ex one prompt has only image and another had only video prompts = ["<video>USER: What do you see in the video? ASSISTANT:", "<image>USER: What do you see in the image? ASSISTANT:", "<video>USER: more video instructions..."],
inputs = processor(text=prompts image=image, video=[clip, clip_2], return_tensors="pt") |
How might one most efficiently batch multiple prompts with 1 single clip/video? e.g. to achieve batched prompts applied to 1 single video Passing in btw in case it helps anyone reading: i had to add padding & truncation args |
@n2nco in that case you have to pass the clip multiple times, as you have two separate prompts each with a special "video" token. Transformers cannot align one video for several clips, as we don't know for sure if that was an intention or a mistake in code, so the safe way is to pass in as many clips as there are special "video" tokens :) |
Just a side note: could you move the fine-tuned notebook to the main page Markdown? It'll be much easier to spot. Much appreciated! |
@WeizhenWang-1210 hey! We don't usually add these notebooks in Transformers docs, but you can find this one and many more in our tutorials repo 🤗 |
Hey, thanks for the awesome work.
def collate_read_video(example, path): def load_videos_from_directory(directory): data = load_videos_from_directory("/mypath") dataset = dataset.map(collate_read_video, batched=False, fn_kwargs={"path": ""}, writer_batch_size= 100) processor = AutoProcessor.from_pretrained(MODEL_ID) class VideoLlavaDataset(Dataset):
def train_collate_fn(examples):
def eval_collate_fn(examples):
train_dataset = VideoLlavaDataset(dataset["train"]) class VideoLlavaModelPLModule(L.LightningModule):
` |
think it'd be straight forward to swap the vicuna-7b for a llama-3-8b base? e.g. https://huggingface.co/lmms-lab/llama3-llava-next-8b |
@BalloutAI , i am not sure where is the "question" that you're referring to in the prompt, and it's weird that the models is getting 100%. Did you try verifying the validation dataloader is correct (shapes and content), and turning on verbose mode to print the prediction/answers? @n2nco yes, swapping the backbone LLM should be easy by tweaking with the model's config, but the new model would require training. AFAIK the llava-Next you're pointing to can do video generation even if it wasn't trained for that. We're working on adding those in transformers 😄 |
Yeah, I have tried printing, and it is getting them correctly ['USER: \nAnswer the following question based on the video by True or False. ASSISTANT: Answer: True']. and it is answering them correctly no matter what the question is for some reason. My guess was that I am feeding the answers to the model directly somehow, but I cant find the problem, because I am getting my answer from the decoded_predictions. |
@BalloutAI Ah, sorry, you're right! Didn't see you had a different way of collate_fn. In the eval_collate when you feed the text to tokenizer, you have to get rid of the answer first. texts = [text.split("Answer: ")[-1] for text in texts] # Extract text w/o answer
batch = processor(text=texts, videos=videos, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt") |
Awesome, thx! I expected that! |
Thanks for your contribution. But I came across a bug: ValueError: Video pixel values should have exactly |
@caichuang0415 hey! Yes, since VIdeoLlava was trained with 8 frames, we currently support only 8-frame videos. You can open a PR if you want to give it a chance, otherwise I'll take a look at it next week :) |
@caichuang0415 now Video-LLaVa can work with any number of frames at input, But note that inference with more than 8 frames degrades quality, as the model wasn't trained in that setting. I recommend to tune with 24 frames first, if you want good performance. To get the updated version, please update transformers with: |
thanks for your updating! I will take your advise and make more experiments |
Hey!
Video-LLaVa is now available in the Transformers library! Feel free to check it out here. Thanks to @LinB203 for helping to ship the model 🤗
To get the model, update transformers by running:
!pip install --upgrade git+https://github.com/huggingface/transformers.git
. Inference with videos can be done as follows:Check out:
The text was updated successfully, but these errors were encountered: