Deciphering how DNA determines an organism's development, phenotype, genetic traits, and disease predisposition remains a significant challenge, with critical applications in human genetics depending on improved solutions. Motivated by the recent release of the Logan dataset by Chikhi et al. (50 petabases of preassembled yet unlabeled biological sequences across hundreds of thousands of species) and the success of large language models (LLMs) in human language, we aim to train genomic language models (gLM) to implicitly capture biological functional elements and their organization.
This work aims to produce the first large-scale, publicly available gLM trained on over 50 petabases of data from all sequenced organisms, capturing the full diversity of the DNA language and enhancing our understanding of genetic mechanisms.
- Code implementation and benchmarks are incoming. Stay tuned for updates as we release the codebase and performance benchmarks for community use.
We welcome contributions and collaborations! Open an issue or pull a request to get involved.
*Equal contribution