Skip to content

Latest commit

 

History

History
61 lines (48 loc) · 3.82 KB

README.md

File metadata and controls

61 lines (48 loc) · 3.82 KB

joint_variant_calling

Python package to run Join Calling on population at NGI (National Genomics Infrastructure) Sweden. The script joint_variant_calling.py implements the GATK-Workflow described in https://www.broadinstitute.org/gatk/guide/article?id=3893

With option --mixed-positions VQSR step is executed following best-practice described in http://gatkforums.broadinstitute.org/gatk/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr in order to avoid the problem with MIXED positions (i.e., position where an indel overlaps a SNP). This is the mode used to run SweGen dataset.

Samples to be be join called can be specified in two ways:

  • in the sample field of the config.yaml file to be provided as input. Complete path to the gvcf file needs to be provided
  • a file named 00_samples.txt in the cwd. If present samples specified in this file will be used instead of those speccified in the config

If run like python joint_variant_calling.py --config config.yaml it creates the following folder structure

  • 00_intervals: optional see Intervals section
  • 00_samples.txt: samples that are processed (i.e., samples that are join called)
  • 01_CombineGVCFs: step one is CombineGVCFs, batching gvcf files together
  • 02_GenotypeGVCFs: then run GenotypeGVCFs
  • 03_CatVariants: merge the gvcfs into one in case computation has been spread out into intervals (see Intervals section)
  • 04_SelectVariants: extract SNPs and INDELs and run eveluation tool from GATK to prepare VQSR
  • 05_VariantRecalibrator: first step of VQSR
  • 06_ApplyRecalibration: second step of VQSR

If run like python joint_variant_calling.py --config config.yaml --mixed-positions it creates the following folder structure

Folders 01, 02, ..., 06 are all organised in the same way:

  • sbatch: sbatch files to be submitted to the slurm queue (Uppmax assumed)
  • std_err: output of the standard error
  • std_out: output of the standard output
  • VCF: contains the gvcf or vcf files (in general results files, in case of 05_VariantRecalibrator contains recalibration tables)

Folder 01_CombineGVCFs contains an extra sub-folder:

  • batches: used to restart joint calling if new samples are available [UNSTABLE: UNDER TEST]

If run like python joint_variant_calling.py --config config.yaml --resume it resumes the joint calling adding new samples and recomputing only the last (if needed) and the new batches (in 01_CombineGVCFs)

Intervals

The most time consuming steps of the workflow are 01_CombineGVCFs and 02_GenotypeGVCFs. These two steps can be parallised running the commands on non overlapping sections of the genome. For this purpose an utility script is provided: python utils/create_interval_list.py --dict intervals/human_g1k_v37.dict --block 1000000000 the file intervals/human_g1k_v37.dict can be found in this repo. If run in a directory (e.g. 00_intervals) it creates the intrvals files (.intervals). This folder need to be provided in the config.yaml file. Running the example command will create 4 blocks, first 3 of 1Gbp and the last one of ~200Mbp.

The example directory in this repo contains a dry-run of joint_variant_calling.py run on intervals generated by this command, on 7 samples, in batches of 4 (i.e.,this creates two batches, one of 4 and one of 3 samples)