Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect aggregative counts for bio.tools #45

Open
bryan-brancotte opened this issue Mar 12, 2021 · 21 comments
Open

Incorrect aggregative counts for bio.tools #45

bryan-brancotte opened this issue Mar 12, 2021 · 21 comments
Assignees
Labels

Comments

@bryan-brancotte
Copy link
Member

bryan-brancotte commented Mar 12, 2021

Thanks to @matuskalas in #20 (comment) who spot that the count of https://edamontology.github.io/edam-browser/#operation_3198 is wrong for bio.tools.

I think it might be related to the fact that the term is duplicated, and thus counted twice, but when looking at https://edamontology.github.io/edam-browser/#topic_0084 the descendant count make no sens at all for now.

@matuskalas
Copy link
Member

Indeed, thanks for creating a dedicated bug issue for it @bryan-brancotte ! 👍🏽

Are the numbers returned by the bio.tools API calls correct in these cases?

@bryan-brancotte
Copy link
Member Author

I think API responses are ok (quick test indicat it) it is the aggregation of the descendant made here.

If bio.tools API would return the count including the descendants, it would be both safer as unit tested, and faster ! ;)

@HagerDakroury
Copy link
Collaborator

Hi, can I get some guidance for testing this?

As I understood, the count should be the current term count + the descendants' counts.

The error offset for #operation_3198 is almost double the current value. But for #topic_0084 it's ~100. So the duplication being the issue is ruled out for now.

Is this correct?

@matuskalas
Copy link
Member

Thank you a billion times @HagerDakroury for looking into this issue! 👍🏽🙏🏽

I found an example with smaller numbers: https://edamontology.github.io/edam-browser/#data_2977. For as output: 20 times as is, 80 times with 3 "descendants" (wrong), the 3 sub-concepts have 15, 13, and 6. So the sum should have been <=54

@matuskalas
Copy link
Member

Another: https://edamontology.github.io/edam-browser/#format_1974
1 time as output, 66 times with 1 "descendant", in fact the only sub-concept is not used 🤔

@HagerDakroury
Copy link
Collaborator

Thank you a billion times @HagerDakroury for looking into this issue! 👍🏽🙏🏽

I found an example with smaller numbers: https://edamontology.github.io/edam-browser/#data_2977. For as output: 20 times as is, 80 times with 3 "descendants" (wrong), the 3 sub-concepts have 15, 13, and 6. So the sum should have been <=54

That definitely seems pretty random. Even if sub-concepts are counted twice (duplicate). The maths doesn't add up!

Another: https://edamontology.github.io/edam-browser/#format_1974
1 time as output, 66 times with 1 "descendant", in fact the only sub-concept is not used 🤔

hmm, I'm not getting 66 times? Here's what comes up
image

Actually, that example should have not used as input (1) - 1 times as output (1) . So maybe the sub-concept usage as input got mistakingly counted as an output?

@matuskalas
Copy link
Member

matuskalas commented Apr 7, 2021

Indeed, now I'm getting the same as, with 2. I must have mixed it up yesterday.

In any case, I found out where the numbers come from! 🙌🏽
They are just the actual number for the given concept, multiplied by the number of its sub-concepts +1 (the concept itself). Nice to see e.g. here:
https://edamontology.github.io/edam-browser/#topic_3678
https://edamontology.github.io/edam-browser/#format_1975
even if there are more layers https://edamontology.github.io/edam-browser/#topic_0605

If you would like to dive into fixing this @HagerDakroury you'll be the hero! 🦸‍♀️
I'd personally suggest 3 options (but let's hear what @bryan-brancotte & @hmenager say)

  • Simple: Just summing up the numbers, and showing something like upto X-times with its sub-concepts (not caring that some might be counted multiple times)
  • Or a correct solution (but I'm wondering whether it will perform well in case of 1000s of entries found): Retrieving the full lists for each sub-concept, creating a union set (i.e. no duplicates), and counting them.
  • Maybe a compromise: Like above but cut off if e.g. more than 1000 and display just >1000. Or a smaller number if needed.

However, if the 2nd or 3rd solutions are too complex to implement, then we perhaps don't want anyone spending too much time with it, as it really should be the bio.tools supplying those numbers from its API (hopefully one day, finally😊). What do you all think?

@HagerDakroury
Copy link
Collaborator

In any case, I found out where the numbers come from! 🙌🏽
They are just the actual number for the given concept, multiplied by the number of its sub-concepts +1 (the concept itself).

Impressive! 😃 I spent way too much time trying to crack the maths here. Can't believe it's that simple!

If you would like to dive into fixing this @HagerDakroury you'll be the hero! 🦸‍♀️

Absolutely! I'd love to dig deep here. It'd be an interesting challenge.

However, if the 2nd or 3rd solutions are too complex to implement, then we perhaps don't want anyone spending too much time with it, as it really should be the bio.tools supplying those numbers from its API (hopefully one day, finally😊). What do you all think?

It'd really depend on how the tree is modeled and what's the data pulled from bio.tools right now (I haven't examined this part's code in detail yet)

I guess the 2 points to be figured out:

  1. Are the children's counts even accessed right now? ( or the number of children is the only info gathered?)
  2. Does it make more sense to make several recursive requests to bio.tools whenever a node is selected?
    or pull all the data from the beginning and just do the aggregation locally for each selected node.

Yes, it may take time to fix it, but sure would be thrilling to try!

I'd personally suggest 3 options (but let's hear what @bryan-brancotte & @hmenager say)

I'll try exploring the code more in the meantime while waiting for their input.

@matuskalas
Copy link
Member

Super cool, I like your attitude @HagerDakroury 🙌🏽🙌🏽🙌🏽
Awesome that you want to take it up as a challenge!! 😉

I'll try exploring the code more in the meantime while waiting for their input.

Perfect! Million thanks & all best 🤞🏽

bryan-brancotte added a commit that referenced this issue Apr 8, 2021
…k on using element uri, not queue[j].data.data.uri
@bryan-brancotte
Copy link
Member Author

Hi @HagerDakroury @matuskalas

Thanks to both of you for diving in this issue, and thanks even more for finding example that helped me to understand what was going on. I added comments with 5275c73. The issue was that get_api_url(queue[j].text) was called with the value of an unset attribut in https://github.com/edamontology/edam-browser/blame/5275c73472659b67ebccfe05277a63044328adaa/js/bio.tools.api.js#L60 as it was not set, get_api_url was using the uri of the current (selected) element in https://github.com/edamontology/edam-browser/blame/5275c73472659b67ebccfe05277a63044328adaa/js/bio.tools.api.js#L135 which produced in the end that descendant usage was term usage * number of its sub-concepts +1

Patch is pushed so actual id of descendant are used, and as you can see it now works
Capture d’écran de 2021-04-08 15-30-00

great team works !

@HagerDakroury
Copy link
Collaborator

Awesome! 👏 It's such a relief that it only took one line to fix that 😃

Also, thanks for adding the comments! That made navigating the code so much easier.

@matuskalas
Copy link
Member

Nice, thanks so much @bryan-brancotte for fixing this!! 🙌🏽

Is it then the 1st option from

  • Simple: Just summing up the numbers, and showing something like upto X-times with its sub-concepts (not caring that some might be counted multiple times)
  • Or a correct solution (but I'm wondering whether it will perform well in case of 1000s of entries found): Retrieving the full lists for each sub-concept, creating a union set (i.e. no duplicates), and counting them.
  • Maybe a compromise: Like above but cut off if e.g. more than 1000 and display just >1000. Or a smaller number if needed.

?

Then it would be nice to slightly update the tooltip to include the "upto", wouldn't it?

And @HagerDakroury @bryan-brancotte @hmenager, do you think that a more sophisticated solution would be too slow or too complicated? I'd personally say that if it should be potentially slow or potentially a complex code hard to maintain, then let's be happy with the 1st "upto" solution 😊

@bryan-brancotte
Copy link
Member Author

re-opening as some question are unanswered

@bryan-brancotte
Copy link
Member Author

Thanks @matuskalas, and also for remininding me that for now it still count duplicated child :/

I would definitely go for option 2: show to correct sum. Note that we do not count the descendant when looking at node of depth 0 or 1, i.e: EDAM and the four root node.

I pushed fa66e07 where duplicated node are not counted twice

@HagerDakroury
Copy link
Collaborator

Removing duplicates is doable, I think fa66e07 took care of it and now it's the correct sum.

And @HagerDakroury @bryan-brancotte @hmenager, do you think that a more sophisticated solution would be too slow or too complicated? I'd personally say that if it should be potentially slow or potentially a complex code hard to maintain, then let's be happy with the 1st "upto" solution 😊

I'd say it's slow but not unnecessarily slow 😄

Maybe the way to make this faster is by somehow saving the values returned for future reuse or doing the requests beforehand. But the performance is currently not that bad IMO. So, restructuring the calls is not a priority? Since they work perfectly fine now.

@matuskalas
Copy link
Member

Very fast @bryan-brancotte, respect! 🥇

It definitely looks like running fast & with some "caching" ☺

Do I understand it right that now it takes into account when the same sub-concept appears multiple times in the sub-"tree"?

There's another source of duplication I meant, between entries, and that one is perhaps the slow|complex one to solve?
Looking at the 2 sub-concepts of https://edamontology.github.io/edam-browser/#topic_0082, some tools in bio.tools are annotated with both:
Screen Shot 04-08-21 at 05 03 PM
Screen Shot 04-08-21 at 04 51 PM Screen Shot 04-08-21 at 04 51 PM 001

@bryan-brancotte
Copy link
Member Author

Prior to #45 (comment)

Indeed fa66e07 toke care of it, but might not be optimal on the way to sum the dictionary. I also think performance are not that bad, but there are still way of improvement :

  • make bio.tools do the works 😉 @matuskalas
  • improve the dictionary sum
  • do not call the api twice when a node is duplicated as fa66e07 only prevent that it is counted twice,

After #45 (comment)

Sadly HOPMA will be counted twice, we only get the "count" attribut for a given concept. We might be able to actually get the tools, and then count them, but it would be better (cpu, network) to do it on bio.tools side.

@bryan-brancotte
Copy link
Member Author

Just to write it somewhere, we store the information of whether a node is duplicated or not in node.duplicate, it is set here

@HagerDakroury
Copy link
Collaborator

There's another source of duplication I meant, between entries, and that one is perhaps the slow|complex one to solve?
Looking at the 2 sub-concepts of https://edamontology.github.io/edam-browser/#topic_0082, some tools in bio.tools are annotated with both:

Ohh, well-spotted! and definitely makes things harder :(

  • do not call the api twice when a node is duplicated as fa66e07 only prevent that it is counted twice,

Is this a common occurrence? (a node being duplicated). If not and since @matuskalas meant another duplication, it may not be a priority to optimize that.

Sadly HOPMA will be counted twice, we only get the "count" attribut for a given concept. We might be able to actually get the tools, and then count them, but it would be better (cpu, network) to do it on bio.tools side.

Definitely, it may be easy code-wise but a huge bump in the time complexity.

But if the accuracy is currently vital for your users, then an upto indication may be necessary after all.

@bryan-brancotte
Copy link
Member Author

Hi, I just re-read the labels, we never say that there is 58 tools associated with Splicing analysis for example, we say that it is used 58 times, and 192 times with its 2 descendants. The count are thus valid, they count the usage and we indicate it.

We could clarify and indicate that there is 192 annotation with the concept and its 2 descendants

@bryan-brancotte
Copy link
Member Author

@matuskalas as you were the one spotting that the count is incorrect ambiguous, do we change the text, or keep it as is and close the issue ?

@bryan-brancotte bryan-brancotte added this to the outreachy202105-1 milestone May 26, 2021
@HagerDakroury HagerDakroury removed this from the outreachy202105-1 milestone Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants