Cemgil Scores > 1 #414

SebastianPoell · 2019-02-07T03:31:00Z

The issue

Hi there, using simple thresholding as a beat detection method can represent a viable baseline when evaluating different beat trackers. By its nature, thresholding tends to gather multiple detections around an annotation. However, this leads to Cemgil scores > 1 and thus, usually ranks thresholding higher than all other algorithms. The reason is that activations and detections are swapped (in contrast to the original implementation). This is also commented in the code:

madmom/madmom/evaluation/beats.py

Lines 473 to 476 in 41155f4

 # Note: the original implementation searches for the closest matches of 

 # detections given the annotations. Since absolute errors > a usual 

 # beat interval produce high errors (and thus in turn add negligible 

 # values to the accuracy), it is safe to swap those two.

I guess, the swapping was done to prevent confusion about the parameter names? For a 'normal' usecase this works fine but I'm currious to whether we should rethink that swapping...

Steps needed to reproduce the behaviour

Just evaluate this example. This results in Cemgil = 1.233.

detections.txt
annotations.txt

superbock · 2019-02-07T09:09:11Z

Yes, that's definitely not the expected behaviour — but I am not sure if it is a bug with only Cemgil's metric.

Swapping annotations and detections reduces Cemgil's score to 0.601, which is way more reasonable. However, I am not sure at all if the other metrics behave as intended. Since your example has way more detections than annotations, also information gain is also pretty close to it's maximum. And even simple metrics like F-measure should be low given that many false positive detections. So it looks more like we are generally unable to handle that many detections.

I think it is safest to do proper peak picking after thresholding. Have you tried features.onsets.peak_picking? It does basically thresholding with local maxima peak picking.

P.S. I am not sure why I was swapping them in the first place, so I have to rethink the whole issue. Also I have to compare the results with mir_eval to see how their implementation behaves.

SebastianPoell · 2019-02-08T06:03:15Z

Yeah, you are right - where I also noticed it was P-Score (although it never went > 1). Hmm, maybe we should rename this issue...

Yes, In most cases, peak picking fixes it completely. However, in the worst case (online networks/online algorithms) I'm still getting ~3% difference for P-Score and Cemgil depending on the swapping. After all, this an edge case, but probably still worth a closer look 🤷‍♂️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cemgil Scores > 1 #414

Cemgil Scores > 1 #414

SebastianPoell commented Feb 7, 2019 •

edited

superbock commented Feb 7, 2019

SebastianPoell commented Feb 8, 2019 •

edited

Cemgil Scores > 1 #414

Cemgil Scores > 1 #414

Comments

SebastianPoell commented Feb 7, 2019 • edited

The issue

Steps needed to reproduce the behaviour

superbock commented Feb 7, 2019

SebastianPoell commented Feb 8, 2019 • edited

SebastianPoell commented Feb 7, 2019 •

edited

SebastianPoell commented Feb 8, 2019 •

edited