Incorrect aggregative counts for bio.tools #45

bryan-brancotte · 2021-03-12T11:46:05Z

Thanks to @matuskalas in #20 (comment) who spot that the count of https://edamontology.github.io/edam-browser/#operation_3198 is wrong for bio.tools.

I think it might be related to the fact that the term is duplicated, and thus counted twice, but when looking at https://edamontology.github.io/edam-browser/#topic_0084 the descendant count make no sens at all for now.

matuskalas · 2021-03-12T15:44:25Z

Indeed, thanks for creating a dedicated bug issue for it @bryan-brancotte ! 👍🏽

Are the numbers returned by the bio.tools API calls correct in these cases?

bryan-brancotte · 2021-03-12T16:16:15Z

I think API responses are ok (quick test indicat it) it is the aggregation of the descendant made here.

If bio.tools API would return the count including the descendants, it would be both safer as unit tested, and faster ! ;)

HagerDakroury · 2021-04-06T14:06:43Z

Hi, can I get some guidance for testing this?

As I understood, the count should be the current term count + the descendants' counts.

The error offset for #operation_3198 is almost double the current value. But for #topic_0084 it's ~100. So the duplication being the issue is ruled out for now.

Is this correct?

matuskalas · 2021-04-06T17:06:11Z

Thank you a billion times @HagerDakroury for looking into this issue! 👍🏽🙏🏽

I found an example with smaller numbers: https://edamontology.github.io/edam-browser/#data_2977. For as output: 20 times as is, 80 times with 3 "descendants" (wrong), the 3 sub-concepts have 15, 13, and 6. So the sum should have been <=54

matuskalas · 2021-04-06T17:10:01Z

Another: https://edamontology.github.io/edam-browser/#format_1974
1 time as output, 66 times with 1 "descendant", in fact the only sub-concept is not used 🤔

HagerDakroury · 2021-04-06T17:46:25Z

Thank you a billion times @HagerDakroury for looking into this issue! 👍🏽🙏🏽

I found an example with smaller numbers: https://edamontology.github.io/edam-browser/#data_2977. For as output: 20 times as is, 80 times with 3 "descendants" (wrong), the 3 sub-concepts have 15, 13, and 6. So the sum should have been <=54

That definitely seems pretty random. Even if sub-concepts are counted twice (duplicate). The maths doesn't add up!

Another: https://edamontology.github.io/edam-browser/#format_1974
1 time as output, 66 times with 1 "descendant", in fact the only sub-concept is not used 🤔

hmm, I'm not getting 66 times? Here's what comes up

Actually, that example should have not used as input (1) - 1 times as output (1) . So maybe the sub-concept usage as input got mistakingly counted as an output?

matuskalas · 2021-04-07T14:38:15Z

Indeed, now I'm getting the same as, with 2. I must have mixed it up yesterday.

In any case, I found out where the numbers come from! 🙌🏽
They are just the actual number for the given concept, multiplied by the number of its sub-concepts +1 (the concept itself). Nice to see e.g. here:
https://edamontology.github.io/edam-browser/#topic_3678
https://edamontology.github.io/edam-browser/#format_1975
even if there are more layers https://edamontology.github.io/edam-browser/#topic_0605

If you would like to dive into fixing this @HagerDakroury you'll be the hero! 🦸‍♀️
I'd personally suggest 3 options (but let's hear what @bryan-brancotte & @hmenager say)

Simple: Just summing up the numbers, and showing something like upto X-times with its sub-concepts (not caring that some might be counted multiple times)
Or a correct solution (but I'm wondering whether it will perform well in case of 1000s of entries found): Retrieving the full lists for each sub-concept, creating a union set (i.e. no duplicates), and counting them.
Maybe a compromise: Like above but cut off if e.g. more than 1000 and display just >1000. Or a smaller number if needed.

However, if the 2nd or 3rd solutions are too complex to implement, then we perhaps don't want anyone spending too much time with it, as it really should be the bio.tools supplying those numbers from its API (hopefully one day, finally😊). What do you all think?

HagerDakroury · 2021-04-07T15:38:25Z

In any case, I found out where the numbers come from! 🙌🏽
They are just the actual number for the given concept, multiplied by the number of its sub-concepts +1 (the concept itself).

Impressive! 😃 I spent way too much time trying to crack the maths here. Can't believe it's that simple!

If you would like to dive into fixing this @HagerDakroury you'll be the hero! 🦸‍♀️

Absolutely! I'd love to dig deep here. It'd be an interesting challenge.

However, if the 2nd or 3rd solutions are too complex to implement, then we perhaps don't want anyone spending too much time with it, as it really should be the bio.tools supplying those numbers from its API (hopefully one day, finally😊). What do you all think?

It'd really depend on how the tree is modeled and what's the data pulled from bio.tools right now (I haven't examined this part's code in detail yet)

I guess the 2 points to be figured out:

Are the children's counts even accessed right now? ( or the number of children is the only info gathered?)
Does it make more sense to make several recursive requests to bio.tools whenever a node is selected?
or pull all the data from the beginning and just do the aggregation locally for each selected node.

Yes, it may take time to fix it, but sure would be thrilling to try!

I'd personally suggest 3 options (but let's hear what @bryan-brancotte & @hmenager say)

I'll try exploring the code more in the meantime while waiting for their input.

matuskalas · 2021-04-07T18:05:06Z

Super cool, I like your attitude @HagerDakroury 🙌🏽🙌🏽🙌🏽
Awesome that you want to take it up as a challenge!! 😉

I'll try exploring the code more in the meantime while waiting for their input.

Perfect! Million thanks & all best 🤞🏽

…k on using element uri, not queue[j].data.data.uri

bryan-brancotte · 2021-04-08T13:35:34Z

Hi @HagerDakroury @matuskalas

Thanks to both of you for diving in this issue, and thanks even more for finding example that helped me to understand what was going on. I added comments with 5275c73. The issue was that get_api_url(queue[j].text) was called with the value of an unset attribut in https://github.com/edamontology/edam-browser/blame/5275c73472659b67ebccfe05277a63044328adaa/js/bio.tools.api.js#L60 as it was not set, get_api_url was using the uri of the current (selected) element in https://github.com/edamontology/edam-browser/blame/5275c73472659b67ebccfe05277a63044328adaa/js/bio.tools.api.js#L135 which produced in the end that descendant usage was term usage * number of its sub-concepts +1

Patch is pushed so actual id of descendant are used, and as you can see it now works

great team works !

HagerDakroury · 2021-04-08T13:49:45Z

Awesome! 👏 It's such a relief that it only took one line to fix that 😃

Also, thanks for adding the comments! That made navigating the code so much easier.

matuskalas · 2021-04-08T13:51:49Z

Nice, thanks so much @bryan-brancotte for fixing this!! 🙌🏽

Is it then the 1st option from

Simple: Just summing up the numbers, and showing something like upto X-times with its sub-concepts (not caring that some might be counted multiple times)

Or a correct solution (but I'm wondering whether it will perform well in case of 1000s of entries found): Retrieving the full lists for each sub-concept, creating a union set (i.e. no duplicates), and counting them.

Maybe a compromise: Like above but cut off if e.g. more than 1000 and display just >1000. Or a smaller number if needed.

?

Then it would be nice to slightly update the tooltip to include the "upto", wouldn't it?

And @HagerDakroury @bryan-brancotte @hmenager, do you think that a more sophisticated solution would be too slow or too complicated? I'd personally say that if it should be potentially slow or potentially a complex code hard to maintain, then let's be happy with the 1st "upto" solution 😊

bryan-brancotte · 2021-04-08T13:53:53Z

re-opening as some question are unanswered

…_3307, WIP #45

bryan-brancotte · 2021-04-08T14:17:52Z

Thanks @matuskalas, and also for remininding me that for now it still count duplicated child :/

I would definitely go for option 2: show to correct sum. Note that we do not count the descendant when looking at node of depth 0 or 1, i.e: EDAM and the four root node.

I pushed fa66e07 where duplicated node are not counted twice

HagerDakroury · 2021-04-08T14:31:41Z

Removing duplicates is doable, I think fa66e07 took care of it and now it's the correct sum.

And @HagerDakroury @bryan-brancotte @hmenager, do you think that a more sophisticated solution would be too slow or too complicated? I'd personally say that if it should be potentially slow or potentially a complex code hard to maintain, then let's be happy with the 1st "upto" solution 😊

I'd say it's slow but not unnecessarily slow 😄

Maybe the way to make this faster is by somehow saving the values returned for future reuse or doing the requests beforehand. But the performance is currently not that bad IMO. So, restructuring the calls is not a priority? Since they work perfectly fine now.

matuskalas · 2021-04-08T15:04:18Z

Very fast @bryan-brancotte, respect! 🥇

It definitely looks like running fast & with some "caching" ☺

Do I understand it right that now it takes into account when the same sub-concept appears multiple times in the sub-"tree"?

There's another source of duplication I meant, between entries, and that one is perhaps the slow|complex one to solve?
Looking at the 2 sub-concepts of https://edamontology.github.io/edam-browser/#topic_0082, some tools in bio.tools are annotated with both:

bryan-brancotte · 2021-04-08T15:11:43Z

Prior to #45 (comment)

Indeed fa66e07 toke care of it, but might not be optimal on the way to sum the dictionary. I also think performance are not that bad, but there are still way of improvement :

make bio.tools do the works 😉 @matuskalas
improve the dictionary sum
do not call the api twice when a node is duplicated as fa66e07 only prevent that it is counted twice,

After #45 (comment)

Sadly HOPMA will be counted twice, we only get the "count" attribut for a given concept. We might be able to actually get the tools, and then count them, but it would be better (cpu, network) to do it on bio.tools side.

bryan-brancotte · 2021-04-08T15:14:40Z

Just to write it somewhere, we store the information of whether a node is duplicated or not in node.duplicate, it is set here

HagerDakroury · 2021-04-08T15:35:03Z

There's another source of duplication I meant, between entries, and that one is perhaps the slow|complex one to solve?
Looking at the 2 sub-concepts of https://edamontology.github.io/edam-browser/#topic_0082, some tools in bio.tools are annotated with both:

Ohh, well-spotted! and definitely makes things harder :(

do not call the api twice when a node is duplicated as fa66e07 only prevent that it is counted twice,

Is this a common occurrence? (a node being duplicated). If not and since @matuskalas meant another duplication, it may not be a priority to optimize that.

Sadly HOPMA will be counted twice, we only get the "count" attribut for a given concept. We might be able to actually get the tools, and then count them, but it would be better (cpu, network) to do it on bio.tools side.

Definitely, it may be easy code-wise but a huge bump in the time complexity.

But if the accuracy is currently vital for your users, then an upto indication may be necessary after all.

bryan-brancotte · 2021-04-09T11:10:21Z

Hi, I just re-read the labels, we never say that there is 58 tools associated with Splicing analysis for example, we say that it is used 58 times, and 192 times with its 2 descendants. The count are thus valid, they count the usage and we indicate it.

We could clarify and indicate that there is 192 annotation with the concept and its 2 descendants

bryan-brancotte · 2021-05-26T08:36:43Z

@matuskalas as you were the one spotting that the count is ~~incorrect~~ ambiguous, do we change the text, or keep it as is and close the issue ?

bryan-brancotte added the bug label Mar 12, 2021

bryan-brancotte added a commit that referenced this issue Apr 8, 2021

Fixing #45, queue[j].text was not set, making get_api_url to fall bac…

52944fd

…k on using element uri, not queue[j].data.data.uri

bryan-brancotte closed this as completed Apr 8, 2021

bryan-brancotte reopened this Apr 8, 2021

bryan-brancotte assigned bryan-brancotte, HagerDakroury and matuskalas Apr 8, 2021

bryan-brancotte added a commit that referenced this issue Apr 8, 2021

prevent duplicated child from being counted twice, example with topic…

fa66e07

…_3307, WIP #45

bryan-brancotte added this to the outreachy202105-1 milestone May 26, 2021

HagerDakroury removed this from the outreachy202105-1 milestone Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect aggregative counts for bio.tools #45

Incorrect aggregative counts for bio.tools #45

bryan-brancotte commented Mar 12, 2021 •

edited

Loading

matuskalas commented Mar 12, 2021

bryan-brancotte commented Mar 12, 2021

HagerDakroury commented Apr 6, 2021

matuskalas commented Apr 6, 2021

matuskalas commented Apr 6, 2021

HagerDakroury commented Apr 6, 2021

matuskalas commented Apr 7, 2021 •

edited

Loading

HagerDakroury commented Apr 7, 2021

matuskalas commented Apr 7, 2021

bryan-brancotte commented Apr 8, 2021

HagerDakroury commented Apr 8, 2021

matuskalas commented Apr 8, 2021

bryan-brancotte commented Apr 8, 2021

bryan-brancotte commented Apr 8, 2021

HagerDakroury commented Apr 8, 2021

matuskalas commented Apr 8, 2021

bryan-brancotte commented Apr 8, 2021

bryan-brancotte commented Apr 8, 2021

HagerDakroury commented Apr 8, 2021

bryan-brancotte commented Apr 9, 2021

bryan-brancotte commented May 26, 2021

Incorrect aggregative counts for bio.tools #45

Incorrect aggregative counts for bio.tools #45

Comments

bryan-brancotte commented Mar 12, 2021 • edited Loading

matuskalas commented Mar 12, 2021

bryan-brancotte commented Mar 12, 2021

HagerDakroury commented Apr 6, 2021

matuskalas commented Apr 6, 2021

matuskalas commented Apr 6, 2021

HagerDakroury commented Apr 6, 2021

matuskalas commented Apr 7, 2021 • edited Loading

HagerDakroury commented Apr 7, 2021

matuskalas commented Apr 7, 2021

bryan-brancotte commented Apr 8, 2021

HagerDakroury commented Apr 8, 2021

matuskalas commented Apr 8, 2021

bryan-brancotte commented Apr 8, 2021

bryan-brancotte commented Apr 8, 2021

HagerDakroury commented Apr 8, 2021

matuskalas commented Apr 8, 2021

bryan-brancotte commented Apr 8, 2021

bryan-brancotte commented Apr 8, 2021

HagerDakroury commented Apr 8, 2021

bryan-brancotte commented Apr 9, 2021

bryan-brancotte commented May 26, 2021

bryan-brancotte commented Mar 12, 2021 •

edited

Loading

matuskalas commented Apr 7, 2021 •

edited

Loading