gLM-collection

Overview

Deciphering how DNA determines an organism's development, phenotype, genetic traits, and disease predisposition remains a significant challenge, with critical applications in human genetics depending on improved solutions. Motivated by the recent release of the Logan dataset by Chikhi et al. (50 petabases of preassembled yet unlabeled biological sequences across hundreds of thousands of species) and the success of large language models (LLMs) in human language, we aim to train genomic language models (gLM) to implicitly capture biological functional elements and their organization.

Impact

This work aims to produce the first large-scale, publicly available gLM trained on over 50 petabases of data from all sequenced organisms, capturing the full diversity of the DNA language and enhancing our understanding of genetic mechanisms.

Upcoming

Code implementation and benchmarks are incoming. Stay tuned for updates as we release the codebase and performance benchmarks for community use.

Contributions and Contact

We welcome contributions and collaborations! Open an issue or pull a request to get involved.

Authors

Kalin Nonchev* - Email | LinkedIn
Manuel Burger* - Email | LinkedIn
Andre Kahles
Gunnar Rätsch

*Equal contribution

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gLM-collection

Overview

Impact

Upcoming

Contributions and Contact

Authors

About

Releases

Packages

ratschlab/gLM-collection

Folders and files

Latest commit

History

Repository files navigation

gLM-collection

Overview

Impact

Upcoming

Contributions and Contact

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages