Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cemgil Scores > 1 #414

Open
SebastianPoell opened this issue Feb 7, 2019 · 2 comments
Open

Cemgil Scores > 1 #414

SebastianPoell opened this issue Feb 7, 2019 · 2 comments

Comments

@SebastianPoell
Copy link
Contributor

SebastianPoell commented Feb 7, 2019

The issue

Hi there, using simple thresholding as a beat detection method can represent a viable baseline when evaluating different beat trackers. By its nature, thresholding tends to gather multiple detections around an annotation. However, this leads to Cemgil scores > 1 and thus, usually ranks thresholding higher than all other algorithms. The reason is that activations and detections are swapped (in contrast to the original implementation). This is also commented in the code:

# Note: the original implementation searches for the closest matches of
# detections given the annotations. Since absolute errors > a usual
# beat interval produce high errors (and thus in turn add negligible
# values to the accuracy), it is safe to swap those two.

I guess, the swapping was done to prevent confusion about the parameter names? For a 'normal' usecase this works fine but I'm currious to whether we should rethink that swapping...

Steps needed to reproduce the behaviour

Just evaluate this example. This results in Cemgil = 1.233.

detections.txt
annotations.txt

@superbock
Copy link
Collaborator

Yes, that's definitely not the expected behaviour — but I am not sure if it is a bug with only Cemgil's metric.

Swapping annotations and detections reduces Cemgil's score to 0.601, which is way more reasonable. However, I am not sure at all if the other metrics behave as intended. Since your example has way more detections than annotations, also information gain is also pretty close to it's maximum. And even simple metrics like F-measure should be low given that many false positive detections. So it looks more like we are generally unable to handle that many detections.

I think it is safest to do proper peak picking after thresholding. Have you tried features.onsets.peak_picking? It does basically thresholding with local maxima peak picking.

P.S. I am not sure why I was swapping them in the first place, so I have to rethink the whole issue. Also I have to compare the results with mir_eval to see how their implementation behaves.

@SebastianPoell
Copy link
Contributor Author

SebastianPoell commented Feb 8, 2019

Yeah, you are right - where I also noticed it was P-Score (although it never went > 1). Hmm, maybe we should rename this issue...

Yes, In most cases, peak picking fixes it completely. However, in the worst case (online networks/online algorithms) I'm still getting ~3% difference for P-Score and Cemgil depending on the swapping. After all, this an edge case, but probably still worth a closer look 🤷‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants