-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect aggregative counts for bio.tools #45
Comments
Indeed, thanks for creating a dedicated bug issue for it @bryan-brancotte ! 👍🏽 Are the numbers returned by the bio.tools API calls correct in these cases? |
I think API responses are ok (quick test indicat it) it is the aggregation of the descendant made here. If bio.tools API would return the count including the descendants, it would be both safer as unit tested, and faster ! ;) |
Hi, can I get some guidance for testing this? As I understood, the count should be the current term count + the descendants' counts. The error offset for #operation_3198 is almost double the current value. But for #topic_0084 it's ~100. So the duplication being the issue is ruled out for now. Is this correct? |
Thank you a billion times @HagerDakroury for looking into this issue! 👍🏽🙏🏽 I found an example with smaller numbers: https://edamontology.github.io/edam-browser/#data_2977. For as output: 20 times as is, 80 times with 3 "descendants" (wrong), the 3 sub-concepts have 15, 13, and 6. So the sum should have been <=54 |
Another: https://edamontology.github.io/edam-browser/#format_1974 |
That definitely seems pretty random. Even if sub-concepts are counted twice (duplicate). The maths doesn't add up!
hmm, I'm not getting 66 times? Here's what comes up Actually, that example should have not used as input (1) - 1 times as output (1) . So maybe the sub-concept usage as input got mistakingly counted as an output? |
Indeed, now I'm getting the same as, with 2. I must have mixed it up yesterday. In any case, I found out where the numbers come from! 🙌🏽 If you would like to dive into fixing this @HagerDakroury you'll be the hero! 🦸♀️
However, if the 2nd or 3rd solutions are too complex to implement, then we perhaps don't want anyone spending too much time with it, as it really should be the bio.tools supplying those numbers from its API (hopefully one day, finally😊). What do you all think? |
Impressive! 😃 I spent way too much time trying to crack the maths here. Can't believe it's that simple!
Absolutely! I'd love to dig deep here. It'd be an interesting challenge.
It'd really depend on how the tree is modeled and what's the data pulled from bio.tools right now (I haven't examined this part's code in detail yet) I guess the 2 points to be figured out:
Yes, it may take time to fix it, but sure would be thrilling to try!
I'll try exploring the code more in the meantime while waiting for their input. |
Super cool, I like your attitude @HagerDakroury 🙌🏽🙌🏽🙌🏽
Perfect! Million thanks & all best 🤞🏽 |
…k on using element uri, not queue[j].data.data.uri
Thanks to both of you for diving in this issue, and thanks even more for finding example that helped me to understand what was going on. I added comments with 5275c73. The issue was that Patch is pushed so actual id of descendant are used, and as you can see it now works great team works ! |
Awesome! 👏 It's such a relief that it only took one line to fix that 😃 Also, thanks for adding the comments! That made navigating the code so much easier. |
Nice, thanks so much @bryan-brancotte for fixing this!! 🙌🏽 Is it then the 1st option from
? Then it would be nice to slightly update the tooltip to include the "upto", wouldn't it? And @HagerDakroury @bryan-brancotte @hmenager, do you think that a more sophisticated solution would be too slow or too complicated? I'd personally say that if it should be potentially slow or potentially a complex code hard to maintain, then let's be happy with the 1st "upto" solution 😊 |
re-opening as some question are unanswered |
Thanks @matuskalas, and also for remininding me that for now it still count duplicated child :/ I would definitely go for option 2: show to correct sum. Note that we do not count the descendant when looking at node of depth 0 or 1, i.e: EDAM and the four root node. I pushed fa66e07 where duplicated node are not counted twice |
Removing duplicates is doable, I think fa66e07 took care of it and now it's the correct sum.
I'd say it's slow but not unnecessarily slow 😄 Maybe the way to make this faster is by somehow saving the values returned for future reuse or doing the requests beforehand. But the performance is currently not that bad IMO. So, restructuring the calls is not a priority? Since they work perfectly fine now. |
Very fast @bryan-brancotte, respect! 🥇 It definitely looks like running fast & with some "caching" ☺ Do I understand it right that now it takes into account when the same sub-concept appears multiple times in the sub-"tree"? There's another source of duplication I meant, between entries, and that one is perhaps the slow|complex one to solve? |
Prior to #45 (comment) Indeed fa66e07 toke care of it, but might not be optimal on the way to sum the dictionary. I also think performance are not that bad, but there are still way of improvement :
After #45 (comment) Sadly HOPMA will be counted twice, we only get the "count" attribut for a given concept. We might be able to actually get the tools, and then count them, but it would be better (cpu, network) to do it on bio.tools side. |
Just to write it somewhere, we store the information of whether a node is duplicated or not in |
Ohh, well-spotted! and definitely makes things harder :(
Is this a common occurrence? (a node being duplicated). If not and since @matuskalas meant another duplication, it may not be a priority to optimize that.
Definitely, it may be easy code-wise but a huge bump in the time complexity. But if the accuracy is currently vital for your users, then an upto indication may be necessary after all. |
Hi, I just re-read the labels, we never say that there is 58 tools associated with Splicing analysis for example, we say that it is used 58 times, and We could clarify and indicate that there is |
@matuskalas as you were the one spotting that the count is |
Thanks to @matuskalas in #20 (comment) who spot that the count of https://edamontology.github.io/edam-browser/#operation_3198 is wrong for bio.tools.
I think it might be related to the fact that the term is duplicated, and thus counted twice, but when looking at https://edamontology.github.io/edam-browser/#topic_0084 the descendant count make no sens at all for now.
The text was updated successfully, but these errors were encountered: