Question on MobiLlama-V #13

g-h-chen · 2024-03-04T10:27:30Z

Thanks for your great work! In Multimodal MobiLlama of the Results section, you briefly introduce how you developed MobiLlama-V. The model seems to have a LLaVA-like architecture, but is only trained on the visual instruction tuning data, which is the potential reason that MobiLlama-V exhibits mediocre performance. Hence, my questions are the following:

Can you release more details about the architecture and training process of MobiLlama-V?
Did/Will you perform two-stage training instead of only the second stage?
Do you consider using ALLaVA-4V, a high-quality multimodal dataset for vision-language training? This dataset is proposed to improve the performance of small VLMs.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on MobiLlama-V #13

Question on MobiLlama-V #13

g-h-chen commented Mar 4, 2024

Question on MobiLlama-V #13

Question on MobiLlama-V #13

Comments

g-h-chen commented Mar 4, 2024