GATK BaseRecalibrator failing due to issues with known_variants.vcf file #15

Matthew-A-epi · 2024-09-10T14:54:05Z

On Rosalind, I'm encountering an error when running the snp_call_nf pipeline, specifically at the GATK BaseRecalibrator step. The pipeline is failing to find the expected known_variants.vcf file and when using an alternative file (known_snps_unsorted.vcf), it's failing validation.

Error Message:

A USER ERROR has occurred: Couldn't read file file:///local/data/Malaria/Projects/Takala-Harrison/Temporal_Malawi_Matt/snp_call_pipe/snp_call_nf/ref/pf_crosses_v1/known_variants.vcf. Error was: It doesn't exist.

Troubleshooting Steps and Results:

Checked reference directory:
The expected directory (/local/data/Malaria/Projects/Takala-Harrison/Temporal_Malawi_Matt/snp_call_pipe/snp_call_nf/ref/pf_crosses_v1/) exists, but doesn't contain a file named known_variants.vcf. I'm not sure why it is pointing to this directory as I set the link to the ref folder like this:

ln -s /local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref/* ref/

Investigated (suspected) alternative reference file:

/local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref/known_snps_unsorted.vcf (and vcf.idx)

Checked symbolic links:
Attempted to recreate symbolic links to reference files, but they already existed.
Attempted to validate alternative VCF file:
Ran GATK ValidateVariants on known_snps_unsorted.vcf which resulted in the error below the command here:

gatk ValidateVariants -V ref/known_snps_unsorted.vcf -R ref/PlasmoDB-44_Pfalciparum3D7_Genome.fasta

A USER ERROR has occurred: Input ref/known_snps_unsorted.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position Pf3D7_04_v3:37119 are not observed at all in the sample genotypes

Questions:

From this comment in the nextflow.config file, it looks like the 'known_variants.vcf' file is supposed to be used in the stead of the 'known_snps_unsorted.vcf' file.

    // Original known_snps_unsorted.vcf has 944,270 SNPs (from previous lab members).
    // The updated file is pf_crosses_v1/known_variants.vcf which has 66,121 variants (snp/indels)
    // The later is generated following the MalariaGen Pf6 paper: 
    //      MalariaGEN, et al. (2021). An open dataset of Plasmodium falciparum
    //      genome variation in 7,000 worldwide samples. Wellcome Open Res 6, 42.
    //      10.12688/wellcomeopenres.16168.2.

    known_sites = ["$projectDir/ref/pf_crosses_v1/known_variants.vcf"]

Is it correct to use the 'known_variant.vcf'? Even though it only has 66,121 variants?

If it is desirable to use the 'known_variants.vcf' file, is this is the correct file:

./AFRIMS/snp_call_nf/ref/pf_crosses_v1/known_variants.vcf

If the 'known_snps_unsorted.vcf' is the correct file, is there an issue with the error that GATK ValidateVariants produced?
The pipeline seems to be looking for the file in a different location than where the reference files are symlinked. How can I correct this path issue in the Nextflow configuration? Maybe adding a param for reference directory that is different from projectDir like so:

Edit params as follows:

  params {
      // Add this line near the top of the params section
      ref_dir = "$projectDir/ref"
  
      // Then update all the paths to use this new parameter
      parasite {
          fasta = "${params.ref_dir}/PlasmoDB-44_Pfalciparum3D7_Genome.fasta"
          fasta_prefix = "${params.ref_dir}/PlasmoDB-44_Pfalciparum3D7_Genome"
      }
      host {
          fasta = ["${params.ref_dir}/host/hg38.fasta"]
          fasta_prefix = ["${params.ref_dir}/host/hg38"]
      }
      known_sites = ["${params.ref_dir}/pf_crosses_v1/known_variants.vcf"]

Also, update VQSR resources as well:

      // vqsr is off by default, because the test data is too small and will cause error/crash
      vqsr = false
      vqsr_resources = [ 
          [name: 'jacob2014_microarray_liftover', type: 'truth', prior: 15, vcf: "${params.ref_dir}/jacob2014_chip_sites.vcf" ], 
          [name: 'Pfcross1_3d7_hb3_gatk_pass', type: 'training', prior: 12, vcf: "${params.ref_dir}/pf_crosses_v1/pass_3d7_hb3.gatk.final.vcf.gz" ], 
          [name: 'Pfcross1_7g8_gb4_gatk_pass', type: 'training', prior: 12, vcf: "${params.ref_dir}/pf_crosses_v1/pass_7g8_gb4.gatk.final.vcf.gz" ], 
          [name: 'Pfcross1_hb3_dd2_gatk_pass', type: 'training', prior: 12, vcf: "${params.ref_dir}/pf_crosses_v1/pass_hb3_dd2.gatk.final.vcf.gz" ]
      ]
  }

Command to run pipeline for internal (Rosalind) users:

nextflow run main.nf --ref_dir /local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref

If this is acceptable, here is the README update:

Reference Directory:
The pipeline uses a reference directory for various genome files and known variants.
By default, this is set to $projectDir/ref, but can be changed using the --ref_dir parameter:

nextflow run main.nf --ref_dir /path/to/your/reference/directory

Internal users should use:

nextflow run main.nf --ref_dir /local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref

External users who have set up the reference directory in the default location can run the pipeline without this parameter.

Environment:

Nextflow version: version 22.10.6 build 5843
GATK version:

Using GATK jar /home/matt.adams/miniconda3/envs/snp_call_nf/share/gatk4-4.2.2.0-1/gatk-package-4.2.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/matt.adams/miniconda3/envs/snp_call_nf/share/gatk4-4.2.2.0-1/gatk-package-4.2.2.0-local.jar --version
The Genome Analysis Toolkit (GATK) v4.2.2.0
HTSJDK Version: 2.24.1
Picard Version: 2.25.4

Reference genome: PlasmoDB-44_Pfalciparum3D7_Genome

Any guidance on resolving these issues would be greatly appreciated. Thank you!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GATK BaseRecalibrator failing due to issues with known_variants.vcf file #15

GATK BaseRecalibrator failing due to issues with known_variants.vcf file #15

Matthew-A-epi commented Sep 10, 2024 •

edited

Loading

GATK BaseRecalibrator failing due to issues with known_variants.vcf file #15

GATK BaseRecalibrator failing due to issues with known_variants.vcf file #15

Comments

Matthew-A-epi commented Sep 10, 2024 • edited Loading

Matthew-A-epi commented Sep 10, 2024 •

edited

Loading