Youtube or Video -> Transcription + Frames -> Text embeddings + Image embeddings -> VectorDB -> RAG with image + text.
LLM: Gemini Vision Pro Text embedding: BAAI/bge-large-en-v1.5 Image embedding: OpenAI/CLIP or something. STT: openai/whisper-large-v3
Demo: gradio-app.ipynb