Skip to content

ratschlab/gLM-collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

gLM-collection

Overview

Deciphering how DNA determines an organism's development, phenotype, genetic traits, and disease predisposition remains a significant challenge, with critical applications in human genetics depending on improved solutions. Motivated by the recent release of the Logan dataset by Chikhi et al. (50 petabases of preassembled yet unlabeled biological sequences across hundreds of thousands of species) and the success of large language models (LLMs) in human language, we aim to train genomic language models (gLM) to implicitly capture biological functional elements and their organization.

Impact

This work aims to produce the first large-scale, publicly available gLM trained on over 50 petabases of data from all sequenced organisms, capturing the full diversity of the DNA language and enhancing our understanding of genetic mechanisms.

Upcoming

  • Code implementation and benchmarks are incoming. Stay tuned for updates as we release the codebase and performance benchmarks for community use.

Contributions and Contact

We welcome contributions and collaborations! Open an issue or pull a request to get involved.

Authors

*Equal contribution

About

Collection of genomic Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published