Skip to content

Commit

Permalink
Merge pull request #172 from dianacornejo/master
Browse files Browse the repository at this point in the history
added tested indel cadd annotation
  • Loading branch information
gaow authored Oct 31, 2023
2 parents 379ba00 + 6696a3e commit 738024c
Showing 1 changed file with 85 additions and 24 deletions.
109 changes: 85 additions & 24 deletions variant-annotation/vep.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,17 @@
"* Download the GTF file corresponding to your build, unzip it, and and place in `data/genocode`\n",
"\n",
" - [Gencode v43 / GRCh37](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/GRCh37_mapping/gencode.v43lift37.annotation.gtf.gz) (62M)\n",
" - [Gencode v44 / GRCh38](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz) (47M)\n"
" - [Gencode v44 / GRCh38](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz) (47M)\n",
"\n",
"**5. Download gnomAD databases**\n",
"\n",
"* If you wish to annotate both gnomAD genome and exome frequencies you will need to download both databases. Please keep in mind that these are very big databases. For more information please visit the [gnomAD website](https://gnomad.broadinstitute.org/downloads) \n",
"\n",
" - [gnomAD_exome_v2.1.1 / GRCh38](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz)(86G)\n",
" - [gnomAD_exome_index_v2.1.1 / GRCh38](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz.tbi)(1.3M)\n",
" \n",
" - [gnomAD_genome_v2.1.1 / GRCh37/hg19 ](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz)(461Gb)\n",
" - [gnomAD_genome_index_v2.1.1 / GRCh37/hg19 ](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz.tbi)(2.73M)"
]
},
{
Expand All @@ -123,13 +133,14 @@
},
"source": [
"```\n",
"sos run VEP103_38.ipynb \\\n",
"sos run vep.ipynb \\\n",
"--cwd output \\\n",
"--vcf data/test.vcf \\\n",
"--human_ancestor data/vep/human_ancestor.fa.gz \\\n",
"--conservation_file data/vep/loftee.sql \\\n",
"--gerp_bigwig data/vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw \\\n",
"--cadd_snps data/cadd/whole_genome_SNVs_inclAnno.tsv.gz \\\n",
"--cad_indels data/cadd/gnomad.genomes.r3.0.indel.tsv.gz \\\n",
"--dir_cache data/vep \\\n",
"--walltime 30h \\\n",
"--mem 30G\n",
Expand All @@ -148,6 +159,30 @@
"The output file will be formatted as VCF"
]
},
{
"cell_type": "markdown",
"id": "1b2d3070-b8e1-4c11-938f-8502f5ed2c6f",
"metadata": {
"kernel": "SoS"
},
"source": [
"## Important notes"
]
},
{
"cell_type": "markdown",
"id": "fe961914-34fe-4502-b5c0-8ba1bf4704e2",
"metadata": {
"kernel": "SoS"
},
"source": [
"* Please be mindful that when you run this code those SNVs and indels that are not in the CADD SNV or indel database won't be annotated and that could impact your downstream analysis.\n",
"\n",
"* In this piepline we are using the cache version 103 with the VEP install 110 if this is not what you want please modify the parameters accordingly `cache_version`\n",
"\n",
"* If you would like to annovate clinvar database add the `clinvar_db` paramater which in Columbia's cluster is `/mnt/vast/hpc/csg/data_public/clinvar/clinvar_20231028.vcf.gz`"
]
},
{
"cell_type": "markdown",
"id": "f877092f-7e2b-4a6f-a907-b50062ce1177",
Expand Down Expand Up @@ -180,8 +215,16 @@
"parameter: conservation_file = path\n",
"# GERP bigwig\n",
"parameter: gerp_bigwig = path\n",
"# CADD databse\n",
"parameter: cadd_snps= path\n",
"# CADD database for SNV's\n",
"parameter: cadd_snps = path\n",
"# CADD databse for indels\n",
"parameter: cadd_indels = path\n",
"# Clinvar datavase\n",
"parameter: clinvar_db = path('.')\n",
"# Cache version to use\n",
"parameter: cache_version = int\n",
"# Genome assembly to use\n",
"parameter: assembly = 'GRCh38'\n",
"# Cache dir\n",
"parameter: dir_cache = path\n",
"# For cluster jobs, number commands to run per job\n",
Expand All @@ -201,41 +244,35 @@
},
"outputs": [],
"source": [
"[default]\n",
"[default_1]\n",
"input: vcf\n",
"output: f'{cwd}/{_input:bn}.rare.VEP.gnomAD.vcf', f'{cwd}/{_input:bn}.rare.VEP.gnomAD.vcf.gz' \n",
"output: f'{cwd}/{_input:bn}.rare.VEP.CADD_gnomAD.tsv', f'{cwd}/{_input:bn}.rare.VEP.CADD_gnomAD.tsv.gz' \n",
"parameter: vep_window = 10000\n",
"parameter: most_severe = False\n",
"parameter: everything = True\n",
"bash: container=container, entrypoint=\"micromamba run -a '' -n rare_variation\", expand= \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n",
" vep \\\n",
" --verbose \\\n",
" --vcf \\\n",
" --tab \\\n",
" -i ${_input} \\\n",
" -o ${_output[0]} \\\n",
" --distance ${vep_window} \\\n",
" --no_stats \\\n",
" --cache_version ${cache_version} \\\n",
" --assembly ${assembly} \\\n",
" --force_overwrite \\\n",
" --offline \\\n",
" ${('--most_severe') if most_severe else ''} \\\n",
" --dir_cache ${dir_cache} \\\n",
" --dir_plugins $VEP_PLUGIN_DIR \\\n",
" --everything \\\n",
" ${('--everything') if everything else ''} \\\n",
" ${('--custom file='+ str(clinvar_db)+',short_name=ClinVar,format=vcf,type=exact,coords=0,fields=CLNSIG%CLNREVSTAT%CLNDN') if clinvar_db else ''} \\\n",
" --plugin LoF,human_ancestor_fa:${human_ancestor},loftee_path:$VEP_PLUGIN_DIR,conservation_file:${conservation_file},gerp_bigwig:${gerp_bigwig} \\\n",
" --plugin CADD,snv=${cadd_snps} \n",
" --plugin CADD,snv=${cadd_snps},indels=${cadd_indels}\n",
" bgzip --keep ${_output[0]}\n",
" tabix ${_output[1]}"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8b7cf9d-d8ff-47a2-8784-3949c943d1e8",
"metadata": {
"kernel": "SoS"
},
"outputs": [],
"source": [
"## FIXME: Still not able to get indel CADD annotation working "
]
},
{
"cell_type": "markdown",
"id": "728ff640-3949-4d33-bedd-58da204fc0cc",
Expand All @@ -247,13 +284,37 @@
]
},
{
"cell_type": "markdown",
"id": "8f699756-20d9-42da-a63c-5cdeca56ae46",
"cell_type": "code",
"execution_count": null,
"id": "8e1df42c-302b-4d1b-be2b-0c5d9072e99a",
"metadata": {
"kernel": "SoS"
},
"outputs": [],
"source": [
"- convert this output into a format similar to that of annovar (csv and with column headers)"
"[default_2]\n",
"input: f'{cwd}/{vcf:bn}.rare.VEP.CADD_gnomAD.tsv.gz'\n",
"output: f'{cwd}/{_input:bn}.formatted_anno.tsv'\n",
"python: expand= \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n",
" pandas as pd\n",
" import gzip\n",
" # Initialize an empty list to store the comments\n",
" comments = []\n",
"\n",
" # Open the gzip-compressed VCF file and read it line by line\n",
" with gzip.open(${_input}, 'rt') as file:\n",
" for line in file:\n",
" if line.startswith('#'):\n",
" comments.append(line.strip())\n",
"\n",
" # The last item in the 'comments' list is the header comment\n",
" header_comment = comments[-1]\n",
" # Extract the header from the comment\n",
" header = header_comment.split('#')[1]\n",
" # Split the header into column names\n",
" columns = header.split('\\t')\n",
" df = pd.read_csv(${_input}, comment='#', sep='\\t', names=columns)\n",
" df.to_csv(${_output}, index=False, sep='\\t',header=True)"
]
},
{
Expand Down

0 comments on commit 738024c

Please sign in to comment.