Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GATK BaseRecalibrator failing due to issues with known_variants.vcf file #15

Open
Matthew-A-epi opened this issue Sep 10, 2024 · 0 comments

Comments

@Matthew-A-epi
Copy link
Collaborator

Matthew-A-epi commented Sep 10, 2024

On Rosalind, I'm encountering an error when running the snp_call_nf pipeline, specifically at the GATK BaseRecalibrator step. The pipeline is failing to find the expected known_variants.vcf file and when using an alternative file (known_snps_unsorted.vcf), it's failing validation.

Error Message:

A USER ERROR has occurred: Couldn't read file file:///local/data/Malaria/Projects/Takala-Harrison/Temporal_Malawi_Matt/snp_call_pipe/snp_call_nf/ref/pf_crosses_v1/known_variants.vcf. Error was: It doesn't exist.

Troubleshooting Steps and Results:

  1. Checked reference directory:
    The expected directory (/local/data/Malaria/Projects/Takala-Harrison/Temporal_Malawi_Matt/snp_call_pipe/snp_call_nf/ref/pf_crosses_v1/) exists, but doesn't contain a file named known_variants.vcf. I'm not sure why it is pointing to this directory as I set the link to the ref folder like this:
ln -s /local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref/* ref/
  1. Investigated (suspected) alternative reference file:

/local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref/known_snps_unsorted.vcf (and vcf.idx)

  1. Checked symbolic links:
    Attempted to recreate symbolic links to reference files, but they already existed.

  2. Attempted to validate alternative VCF file:
    Ran GATK ValidateVariants on known_snps_unsorted.vcf which resulted in the error below the command here:

gatk ValidateVariants -V ref/known_snps_unsorted.vcf -R ref/PlasmoDB-44_Pfalciparum3D7_Genome.fasta

A USER ERROR has occurred: Input ref/known_snps_unsorted.vcf fails strict validation of type ALL: one or more of the ALT allele(s) for the record at position Pf3D7_04_v3:37119 are not observed at all in the sample genotypes

Questions:

  1. From this comment in the nextflow.config file, it looks like the 'known_variants.vcf' file is supposed to be used in the stead of the 'known_snps_unsorted.vcf' file.
    // Original known_snps_unsorted.vcf has 944,270 SNPs (from previous lab members).
    // The updated file is pf_crosses_v1/known_variants.vcf which has 66,121 variants (snp/indels)
    // The later is generated following the MalariaGen Pf6 paper: 
    //      MalariaGEN, et al. (2021). An open dataset of Plasmodium falciparum
    //      genome variation in 7,000 worldwide samples. Wellcome Open Res 6, 42.
    //      10.12688/wellcomeopenres.16168.2.

    known_sites = ["$projectDir/ref/pf_crosses_v1/known_variants.vcf"]

Is it correct to use the 'known_variant.vcf'? Even though it only has 66,121 variants?

If it is desirable to use the 'known_variants.vcf' file, is this is the correct file:

./AFRIMS/snp_call_nf/ref/pf_crosses_v1/known_variants.vcf
  1. If the 'known_snps_unsorted.vcf' is the correct file, is there an issue with the error that GATK ValidateVariants produced?

  2. The pipeline seems to be looking for the file in a different location than where the reference files are symlinked. How can I correct this path issue in the Nextflow configuration? Maybe adding a param for reference directory that is different from projectDir like so:

Edit params as follows:

  params {
      // Add this line near the top of the params section
      ref_dir = "$projectDir/ref"
  
      // Then update all the paths to use this new parameter
      parasite {
          fasta = "${params.ref_dir}/PlasmoDB-44_Pfalciparum3D7_Genome.fasta"
          fasta_prefix = "${params.ref_dir}/PlasmoDB-44_Pfalciparum3D7_Genome"
      }
      host {
          fasta = ["${params.ref_dir}/host/hg38.fasta"]
          fasta_prefix = ["${params.ref_dir}/host/hg38"]
      }
      known_sites = ["${params.ref_dir}/pf_crosses_v1/known_variants.vcf"]

Also, update VQSR resources as well:

      // vqsr is off by default, because the test data is too small and will cause error/crash
      vqsr = false
      vqsr_resources = [ 
          [name: 'jacob2014_microarray_liftover', type: 'truth', prior: 15, vcf: "${params.ref_dir}/jacob2014_chip_sites.vcf" ], 
          [name: 'Pfcross1_3d7_hb3_gatk_pass', type: 'training', prior: 12, vcf: "${params.ref_dir}/pf_crosses_v1/pass_3d7_hb3.gatk.final.vcf.gz" ], 
          [name: 'Pfcross1_7g8_gb4_gatk_pass', type: 'training', prior: 12, vcf: "${params.ref_dir}/pf_crosses_v1/pass_7g8_gb4.gatk.final.vcf.gz" ], 
          [name: 'Pfcross1_hb3_dd2_gatk_pass', type: 'training', prior: 12, vcf: "${params.ref_dir}/pf_crosses_v1/pass_hb3_dd2.gatk.final.vcf.gz" ]
      ]
  }

Command to run pipeline for internal (Rosalind) users:

nextflow run main.nf --ref_dir /local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref

If this is acceptable, here is the README update:

Reference Directory:
The pipeline uses a reference directory for various genome files and known variants.
By default, this is set to $projectDir/ref, but can be changed using the --ref_dir parameter:

nextflow run main.nf --ref_dir /path/to/your/reference/directory

Internal users should use:

nextflow run main.nf --ref_dir /local/data/Malaria/Projects/Takala-Harrison/Cambodia_Bing/ref

External users who have set up the reference directory in the default location can run the pipeline without this parameter.

Environment:

  • Nextflow version: version 22.10.6 build 5843

  • GATK version:

Using GATK jar /home/matt.adams/miniconda3/envs/snp_call_nf/share/gatk4-4.2.2.0-1/gatk-package-4.2.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/matt.adams/miniconda3/envs/snp_call_nf/share/gatk4-4.2.2.0-1/gatk-package-4.2.2.0-local.jar --version
The Genome Analysis Toolkit (GATK) v4.2.2.0
HTSJDK Version: 2.24.1
Picard Version: 2.25.4

  • Reference genome: PlasmoDB-44_Pfalciparum3D7_Genome

Any guidance on resolving these issues would be greatly appreciated. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant