Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assets for human germline pipeline #21

Open
apraga opened this issue Jun 1, 2024 · 6 comments
Open

Assets for human germline pipeline #21

apraga opened this issue Jun 1, 2024 · 6 comments

Comments

@apraga
Copy link

apraga commented Jun 1, 2024

Hi,

I've put into a separate assets data needed to run germline analysis for Homo sapiens here : https://github.com/apraga/human_genome_assets https://github.com/apraga/germline-analysis-vep
Do you think it could be added to scidataflow assets ?

At the moment, there's only the latest version of a reference genome (pipeline-ready GRCh38) and databases for annotation (dbSNP, CADD score and VEP cache).
Each dataset is in a separate directory, with subdirectories specifying the genome version and the database version. This info is still in the filename for reference. I've taken the liberty of renaming files to make it more user-friendly.

I plan to update this repository frequently but am open to discussion about its structure.

Note: if #13 can be solved, that would make it easier to work with several version.

Thanks !

@apraga
Copy link
Author

apraga commented Aug 19, 2024

@vsbuffalo Pinging you in case you have some time to look at it. Thanks :)

@vsbuffalo
Copy link
Owner

Hey Alexis,

I am definitely open to hosting this. I would suggest a more specific name, however (as human_genome_assets is fairly general).

I am still thinking how best to handle #13 too. The tricky thing is the separation of a remote manifest and the information about what the user wants locally. Either the remote manifest could have an additional line like local that is a boolean or there could be another file that stores this information (which is a bit messy).

@apraga
Copy link
Author

apraga commented Aug 20, 2024 via email

@vsbuffalo
Copy link
Owner

Hey Alexis,

It still needs to be a more specific name — "human-genome-annotation" is too general, since your asset has specific annotation such as CADD. How about hg38-germline-analysis-cadd?

@apraga
Copy link
Author

apraga commented Sep 1, 2024

Hi,
Rewriting previous comment after thinking it further.

Do you think assets should be formatted according to a set of rules ?
A rather loose format could be $project-$software(-$feature), all lowercase separated by a '-'.
For example, https://github.com/scidataflow-assets/nygc_gatk_1000G_highcov would be nygc-1000G-gatk-highcov with project=nycg-1000G , software=gatk and feature=highcov.
Please feel free to move that suggestion elsewhere if needed.

With that format, I propose the asset for this PR to be renamed germline-analysis-vep. It' rather short and the initial goal is to offet a genome and databases for annotating with VEP. The README has been accordingly.
This assets aims to offer multïple genome version so 'hg38' in the name is not appropriate (T2T is on my TODO list).

What do you think ?

@apraga apraga changed the title Assets for human germiline pipeline Assets for human germline pipeline Sep 17, 2024
@apraga
Copy link
Author

apraga commented Oct 10, 2024

Hi @vsbuffalo, to focus on this PR, are you okay with the new asset name ?
Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants