Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about inconsistencies between the paper and the released data #3

Open
LittletreeZou opened this issue Feb 29, 2024 · 1 comment

Comments

@LittletreeZou
Copy link

Thank you for integrating and opensource the Benckmark dataset.
I noticed that there are some inconsistencies between statistics in the paper and the released data in benchmarks/CodonBERT/data. Here are the confusing parts:

  • For the MLOS flu vaccine data, you show 543 mRNA samples in Table 1 in the paper, but I only found 167 samples in the released data.
  • For SARS-Cov-2 vaccine degradation data, you show 2400 mRNA samples in Table 1 in the paper, but I only found 233 samples in the released data.

Could you kindly clarify them?

BTW, I noticed that some of the datasets are very small. When using a 0.7/0.15/0.15 split on such a small dataset and computing metrics like correlation, the results are not reliable. It would be better that you use k-fold cross validation.

@phil-fradkin
Copy link

To follow up on this maybe we can restrict the scope of the question to the consistency with the datasets.

In downloading the data I found:
(Downloaded - Reported)

MLOS: 167 - 543
TC Riboswitches: 355 - 355
CoV Vaccine: 2400 - 2400
mRFP Expression: 1459 - 1459
Fungal Expression: 7089 - 7056
E. Coli Proteins: 6348 - 6,348
mRNA Stability: 65,356 - 41,123

It would be helpful if the authors clarified the length discrepancies between mRNA stability, Fungal Expression, and MLOS datasets.

Thanks a lot and congrats on the publication!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants