Merge pull request #172 from dianacornejo/master

added tested indel cadd annotation
cumc · Oct 31, 2023 · 738024c · 738024c
2 parents 379ba00 + 6696a3e
commit 738024c
Showing 1 changed file with 85 additions and 24 deletions.
diff --git a/variant-annotation/vep.ipynb b/variant-annotation/vep.ipynb
@@ -102,7 +102,17 @@
  "* Download the GTF file corresponding to your build, unzip it, and and place in `data/genocode`\n",
  "\n",
  " - [Gencode v43 / GRCh37](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/GRCh37_mapping/gencode.v43lift37.annotation.gtf.gz) (62M)\n",
- " - [Gencode v44 / GRCh38](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz) (47M)\n"
+ " - [Gencode v44 / GRCh38](https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz) (47M)\n",
+ "\n",
+ "**5. Download gnomAD databases**\n",
+ "\n",
+ "* If you wish to annotate both gnomAD genome and exome frequencies you will need to download both databases. Please keep in mind that these are very big databases. For more information please visit the [gnomAD website](https://gnomad.broadinstitute.org/downloads) \n",
+ "\n",
+ " - [gnomAD_exome_v2.1.1 / GRCh38](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz)(86G)\n",
+ " - [gnomAD_exome_index_v2.1.1 / GRCh38](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.liftover_grch38.vcf.bgz.tbi)(1.3M)\n",
+ " \n",
+ " - [gnomAD_genome_v2.1.1 / GRCh37/hg19 ](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz)(461Gb)\n",
+ " - [gnomAD_genome_index_v2.1.1 / GRCh37/hg19 ](https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz.tbi)(2.73M)"
  ]
  },
  {
@@ -123,13 +133,14 @@
  },
  "source": [
  "```\n",
- "sos run VEP103_38.ipynb \\\n",
+ "sos run vep.ipynb \\\n",
  "--cwd output \\\n",
  "--vcf data/test.vcf \\\n",
  "--human_ancestor data/vep/human_ancestor.fa.gz \\\n",
  "--conservation_file data/vep/loftee.sql \\\n",
  "--gerp_bigwig data/vep/gerp_conservation_scores.homo_sapiens.GRCh38.bw \\\n",
  "--cadd_snps data/cadd/whole_genome_SNVs_inclAnno.tsv.gz \\\n",
+ "--cad_indels data/cadd/gnomad.genomes.r3.0.indel.tsv.gz \\\n",
  "--dir_cache data/vep \\\n",
  "--walltime 30h \\\n",
  "--mem 30G\n",
@@ -148,6 +159,30 @@
  "The output file will be formatted as VCF"
  ]
  },
+ {
+ "cell_type": "markdown",
+ "id": "1b2d3070-b8e1-4c11-938f-8502f5ed2c6f",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## Important notes"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fe961914-34fe-4502-b5c0-8ba1bf4704e2",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "* Please be mindful that when you run this code those SNVs and indels that are not in the CADD SNV or indel database won't be annotated and that could impact your downstream analysis.\n",
+ "\n",
+ "* In this piepline we are using the cache version 103 with the VEP install 110 if this is not what you want please modify the parameters accordingly `cache_version`\n",
+ "\n",
+ "* If you would like to annovate clinvar database add the `clinvar_db` paramater which in Columbia's cluster is `/mnt/vast/hpc/csg/data_public/clinvar/clinvar_20231028.vcf.gz`"
+ ]
+ },
  {
  "cell_type": "markdown",
  "id": "f877092f-7e2b-4a6f-a907-b50062ce1177",
@@ -180,8 +215,16 @@
  "parameter: conservation_file = path\n",
  "# GERP bigwig\n",
  "parameter: gerp_bigwig = path\n",
- "# CADD databse\n",
- "parameter: cadd_snps= path\n",
+ "# CADD database for SNV's\n",
+ "parameter: cadd_snps = path\n",
+ "# CADD databse for indels\n",
+ "parameter: cadd_indels = path\n",
+ "# Clinvar datavase\n",
+ "parameter: clinvar_db = path('.')\n",
+ "# Cache version to use\n",
+ "parameter: cache_version = int\n",
+ "# Genome assembly to use\n",
+ "parameter: assembly = 'GRCh38'\n",
  "# Cache dir\n",
  "parameter: dir_cache = path\n",
  "# For cluster jobs, number commands to run per job\n",
@@ -201,41 +244,35 @@
  },
  "outputs": [],
  "source": [
- "[default]\n",
+ "[default_1]\n",
  "input: vcf\n",
- "output: f'{cwd}/{_input:bn}.rare.VEP.gnomAD.vcf', f'{cwd}/{_input:bn}.rare.VEP.gnomAD.vcf.gz' \n",
+ "output: f'{cwd}/{_input:bn}.rare.VEP.CADD_gnomAD.tsv', f'{cwd}/{_input:bn}.rare.VEP.CADD_gnomAD.tsv.gz' \n",
  "parameter: vep_window = 10000\n",
+ "parameter: most_severe = False\n",
+ "parameter: everything = True\n",
  "bash: container=container, entrypoint=\"micromamba run -a '' -n rare_variation\", expand= \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n",
  " vep \\\n",
  " --verbose \\\n",
- " --vcf \\\n",
+ " --tab \\\n",
  " -i ${_input} \\\n",
  " -o ${_output[0]} \\\n",
  " --distance ${vep_window} \\\n",
  " --no_stats \\\n",
+ " --cache_version ${cache_version} \\\n",
+ " --assembly ${assembly} \\\n",
  " --force_overwrite \\\n",
  " --offline \\\n",
+ " ${('--most_severe') if most_severe else ''} \\\n",
  " --dir_cache ${dir_cache} \\\n",
  " --dir_plugins $VEP_PLUGIN_DIR \\\n",
- " --everything \\\n",
+ " ${('--everything') if everything else ''} \\\n",
+ " ${('--custom file='+ str(clinvar_db)+',short_name=ClinVar,format=vcf,type=exact,coords=0,fields=CLNSIG%CLNREVSTAT%CLNDN') if clinvar_db else ''} \\\n",
  " --plugin LoF,human_ancestor_fa:${human_ancestor},loftee_path:$VEP_PLUGIN_DIR,conservation_file:${conservation_file},gerp_bigwig:${gerp_bigwig} \\\n",
- " --plugin CADD,snv=${cadd_snps} \n",
+ " --plugin CADD,snv=${cadd_snps},indels=${cadd_indels}\n",
  " bgzip --keep ${_output[0]}\n",
  " tabix ${_output[1]}"
  ]
  },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "d8b7cf9d-d8ff-47a2-8784-3949c943d1e8",
- "metadata": {
- "kernel": "SoS"
- },
- "outputs": [],
- "source": [
- "## FIXME: Still not able to get indel CADD annotation working "
- ]
- },
  {
  "cell_type": "markdown",
  "id": "728ff640-3949-4d33-bedd-58da204fc0cc",
@@ -247,13 +284,37 @@
  ]
  },
  {
- "cell_type": "markdown",
- "id": "8f699756-20d9-42da-a63c-5cdeca56ae46",
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8e1df42c-302b-4d1b-be2b-0c5d9072e99a",
  "metadata": {
  "kernel": "SoS"
  },
+ "outputs": [],
  "source": [
- "- convert this output into a format similar to that of annovar (csv and with column headers)"
+ "[default_2]\n",
+ "input: f'{cwd}/{vcf:bn}.rare.VEP.CADD_gnomAD.tsv.gz'\n",
+ "output: f'{cwd}/{_input:bn}.formatted_anno.tsv'\n",
+ "python: expand= \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n",
+ " pandas as pd\n",
+ " import gzip\n",
+ " # Initialize an empty list to store the comments\n",
+ " comments = []\n",
+ "\n",
+ " # Open the gzip-compressed VCF file and read it line by line\n",
+ " with gzip.open(${_input}, 'rt') as file:\n",
+ " for line in file:\n",
+ " if line.startswith('#'):\n",
+ " comments.append(line.strip())\n",
+ "\n",
+ " # The last item in the 'comments' list is the header comment\n",
+ " header_comment = comments[-1]\n",
+ " # Extract the header from the comment\n",
+ " header = header_comment.split('#')[1]\n",
+ " # Split the header into column names\n",
+ " columns = header.split('\\t')\n",
+ " df = pd.read_csv(${_input}, comment='#', sep='\\t', names=columns)\n",
+ " df.to_csv(${_output}, index=False, sep='\\t',header=True)"
  ]
  },
  {