-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to understand and use the audio embedding? #148
Comments
I am having similar doubts. When extracting text and audio embeddings, I can easily perform cosine similarity to find closely related pairs, and retrieve audio from text inputs and vice-versa. However, I would like to know if there is a way to decode the embeddings into text. Decoding them into Audio seems manageable using AudioLDM. |
@cvillela is there a way to decode CLAP embeddings to Audio using AudioLDM? |
When I make an analogy to CLIP, I would know how to use CLAP. My mind was stuck then ><. Thanks for your hints! |
@arthur19312 The following CLAP implementration also supports a model for audio captioning (not yet tested): https://arxiv.org/abs/2309.05767 |
I'm new here, I run the method
get_audio_embedding_from_filelist
with modelmusic_audioset_epoch_15_esc_90.14.pt
and get the audio embeddings just likeI approximately know it represent the feature of the input audio somehow, while I don't know how to use it.
Could someone tell me what is the audio embedding that I get in format of float? And whether this audio embedding is common to other models? And how should I use it?
(PS: I'm really interested in this work while it seems like I lack some necessary background knowledge, so it would be better if someone could recommend me some relevant materials to get me into the field. Thank you so much ❤)
The text was updated successfully, but these errors were encountered: