Skip to content
This repository has been archived by the owner on Dec 17, 2021. It is now read-only.
/ genotype_qc Public archive

Workflows for performing genotype QC before and after imputation

License

Notifications You must be signed in to change notification settings

eQTL-Catalogue/genotype_qc

Repository files navigation

Genotype Quality Control

This repository contains three workflows for performing genotype data QC for the eQTL Catalogue project.

Parts of this workflow have been merged into the eQTL-Catalogue/geimpute workflow. This workflow is no longer maintained independently.

Dependencies

Most of the software dependencies for the pipelines are listed in the conda environment file. Docker container with all of these dependencies can be obtained from DockerHub.

The pipelines also require GenotypeHarmonizer and LDAK5 that need to be downladed separately. Script for downloading those can be found here.

1. Pre-imputation QC (pre-imputation.nf)

Preparing genotype data for imputation to the 1000 Genomes Phase 3 reference panel with Michigan Imputation Server. We have installed the imputation server locally.

QC steps:

  • Align raw genotypes to the reference panel with Genotype Harmonizer.
  • Convert the genotypes to the VCF format with PLINK.
  • Exclude variants with Hardy-Weinberg p-value < 1e-6, missingness > 0.05 and minor allele frequency < 0.01 with bcftools
  • Calculate individual-level missingness using vcftools.
  • Create separate VCF files for each chromosome.

Execution:

nextflow run pre-imputation_qc.nf -profile eqtl_catalogue -resume\
 --bfile /gpfs/hpc/projects/genomic_references/CEDAR/genotypes/PLINK_100718_1018/CEDAR\
 --output_name CEDAR_GRCh37_genotyped\
 --outdir CEDAR 

2. Convert imputed genotypes to GRCh38 coordinates (crossmap.nf)

3. Project individuals to 1000 Genomes Project reference populations (pop_assign.nf).

Input

Genotype data imputed to 1000 Genomes Phase 3 reference panel.

Analysis steps

  • Perform LD pruning on the reference dataset with PLINK.
  • Perform PCA and project new samples to the reference principal components with LDAK.
nextflow run pop_assign.nf -profile pop_assign --vcf <path_to_vcf.vcf.gz> --data_name <study_name>

Authors

Initial version of the population assignment pipeline was implemented by Katerina Peikova and Marija Samoviča, later modified by Nurlan Kerimov and Kaur Alasoo.

About

Workflows for performing genotype QC before and after imputation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published