Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HarmonixSet segment pre-processing? #8

Open
domderen opened this issue Nov 7, 2023 · 5 comments
Open

HarmonixSet segment pre-processing? #8

domderen opened this issue Nov 7, 2023 · 5 comments

Comments

@domderen
Copy link

domderen commented Nov 7, 2023

Hey @tae-jun,

Thanks for your research and this amazing model! I'm trying to re-create your training process, and I ran into an issue while trying to run allin1-train command. I'm getting this kind of error:

File "/workspace/src/allin1/training/data/datasets/harmonix/dataset.py", line 74, in __getitem__
    data = super().__getitem__(idx)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/src/allin1/training/data/datasets/datasetbase.py", line 78, in __getitem__
    true_function = st.section.of_frames(encode=True, return_labels=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/src/allin1/training/data/eventconverters/eventconverters.py", line 153, in of_frames
    labels = np.array([self.label_map[l] for l in labels])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/src/allin1/training/data/eventconverters/eventconverters.py", line 153, in <listcomp>
    labels = np.array([self.label_map[l] for l in labels])
                       ~~~~~~~~~~~~~~^^^
KeyError: 'prechorus'

As far as I understand, that error is caused by the fact that the HARMONIX_LABELS doesn't contain the value prechrous, but this value exists as one of the segments in the Harmonix data set for example here: https://github.com/urinieto/harmonixset/blob/master/dataset/segments/0006_aint2proud2beg.txt

There are more labels that exist in the Harmonix dataset, but they are not listed in HARMONIX_LABELS, so I'm wondering if you did some kind of manual data preprocessing that converted all segment labels from Harmonix dataset into your list?

I would appreciate your guidance, as I would love to re-create your training process, to see if this model could be trained not only on full songs, but also on parts of songs for beat/downbeat detection.

Thanks in advance for your help!

@tae-jun
Copy link
Member

tae-jun commented Nov 7, 2023

Hi! Awesome that you are training models by yourself.

I'm also curious whether or not the training works on short audio.

Please share the results if it's possible!

But the receptive size of the model is 82 seconds so you might need to make audio segments larger than that.

Ok, so back to your question.

I did merge labels following this previous work: https://arxiv.org/pdf/2205.14700.pdf

They put the code snippet below in the paper:

substrings = [
  ("silence", "silence"), ("pre-chorus", "verse"),
  ("prechorus", "verse"), ("refrain", "chorus"),
  ("chorus", "chorus"), ("theme", "chorus"),
  ("stutter", "chorus"), ("verse", "verse"),
  ("rap", "verse"), ("section", "verse"),
  ("slow", "verse"), ("build", "verse"),
  ("dialog", "verse"), ("intro", "intro"),
  ("fadein", "intro"), ("opening", "intro"),
  ("bridge", "bridge"), ("trans", "bridge"),
  ("out", "outro"), ("coda", "outro"),
  ("ending", "outro"), ("break", "inst"),
  ("inst", "inst"), ("interlude", "inst"),
  ("impro", "inst"), ("solo", "inst")]

def conversion(label):
  if label == "end": return "end"
  for s1, s2 in substrings:
    if s1 in label.lower(): return s2
  return "inst"

Thanks!

@domderen
Copy link
Author

domderen commented Nov 7, 2023

Hey, thanks for a quick response!

I can definitely share my results once I have them. Could you elaborate on what you mean that the receptive size of the model is 82 seconds? I was thinking about training this model with the HarmonixSet songs split into 5 or 10 second chunks, to see if the model would be able to learn how to detect beat/downbeat even on a smaller audio file. To give you some context, I'm wondering if this model could be used in a similar fashion to how Shazam works, where it could detect beat/downbeat when I would feed it with a microphone recorded audio file of a song snippet. My end goal would be to try to create something like this: https://www.haptik.watch/ An application that would be able to provide beat/downbeat highlighting for a song that it could listen to via a microphone.

I understand that this model might be too big to use on a watch or even phone, but your model had the best accuracy for detecting beat/downbeat that I have seen so far, so I wanted to start from it and see if it would be able to detect those features on smaller audio snippets, and if it would still work well, then I was planning to see what I could do to make this model smaller and potentially to make it work on a phone.

And thanks for the info about labels merging, I'll try to follow suit :)

@tae-jun
Copy link
Member

tae-jun commented Nov 7, 2023

Thanks for sharing!

So the model needs 82 seconds of frames to make predictions on a single frame input. Since the model has dilations growing like 20, 21, ... 211, the window size is huge at the last block. So the last layer's window size is about 82 seconds and it requires frames longer than that. You can zero pad but maybe not the smartest method. Maybe you can change depth and/or dilation_factor here to make it use a smaller window.

And yeah I also think this model may not be the best choice for mobile applications. As I said, it needs 82 sec around a frame you want to predict (41 seconds backward, 41 seconds forward) so it would be difficult to make this real-time. Also, it would take a long time to compute on mobile. Actually, the model itself is quite fast but it needs source separation which takes most of the processing time.

Training All-In-One model can be a good starting point to get used to beat tracking if you are new to this. But eventually, you should build a real-time architecture. I think real-time is another big problem in beat tracking so it requires other efforts.

@domderen
Copy link
Author

domderen commented Nov 7, 2023

Ah, thank you for this explanation, I wasn't aware of this design aspect of the model. That's indeed a complicating factor :)

I saw this research with a system design more attuned to my real-time needs, but it seems that downbeat tracking f1 rate there is much lower than with your model. Playing with your demo I can see and hear that your downbeat detection rate is noticably different, compared to something like in aforementioned Haptik. I am not sure what approach they are using exactly, but it is not a great experience.

I got here while researching the topic of downbeat detection for my own music / dance understanding needs, so the topic is quite new for me from the ML point of view. I wanted to use something similar to Haptik for my personal usecase to feel the music while dancing. But visual/sound/haptic feedback that this app gives is off with the music.

I saw that your model inference for a full song, takes roughly similar amount of time as the shazam API takes to identify a song, and since I'm interested only in beat/downbeat tracking functionality, I started wondering if your model might help with the accuracy problem at least in a smaller controlled environment, where I could load song profiles pre-generated by your model and use them in the "player" app on the phone/watch to get that feeling of "sonify" from your app.

There's of course the part where I would need to discover which snoficaton profile to load & sync it with what is actually being played in the speakers. I could do it on the audio player level on the computer, and connect it via network to my watch, but it would be great if I didn't have to, so I started wondering if I could tweak your model to help search through a pre-generated list of sonification results. And that led to me to wanting to check what would happen if I trained your model on smaller song chunks :)

@tae-jun
Copy link
Member

tae-jun commented Nov 11, 2023

Thanks for your detailed explanation.

That's a fun application! If it needs to operate in real-time, then the model can only access past data. This might not be a significant issue for beat detection, but it poses a challenge for downbeat tracking, which requires more semantic understanding.

If you don't need structure analysis, maybe you can reduce the number of parameters even more.

However, the Dilated Neighborhood Attention that the All-In-One model is using requires future data so it's not applicable for real-time purposes. So the blocks should be replaced with causal blocks, for example, WaveNet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants