Skip to content

Latest commit

 

History

History
10 lines (7 loc) · 309 Bytes

README.md

File metadata and controls

10 lines (7 loc) · 309 Bytes

Multimodal Video/Youtube QA

Youtube or Video -> Transcription + Frames -> Text embeddings + Image embeddings -> VectorDB -> RAG with image + text.

LLM: Gemini Vision Pro Text embedding: BAAI/bge-large-en-v1.5 Image embedding: OpenAI/CLIP or something. STT: openai/whisper-large-v3

Demo: gradio-app.ipynb