###########################################################
- Adding model interaction. Now LmTag supports 2 models:
- The linear option models imputation accuracy of a linear funcion of:
Imputation_accuracy ~ LD + MAF_tagSNP + MAF_taggedSNP + distance
. This model was used in previous versions. - The interaction option models imputation accuracy of a linear funcion with interaction term of LD and MAF_taggedSNP:
Imputation_accuracy ~ LD + MAF_tagSNP + MAF_taggedSNP + distance + LD:MAF_taggedSNP
.
In our experiments, the interaction model provides slightly better performance, but it is not significant. We still recommend users to use the linear model for better intepretbility.
- User friendly interface.
- We created 2 scripts
model_pipeline.sh
, andLmTag_pipeline.sh
that provide more friendly interface for users. - We also provide a docker image that can run the LmTag easily.
- Details instructions and tutorials for using the two wrapers and
docker
image is in the sub-directory:docker
.
- Add tagged marker positions column in output.
- Add --vip and --exclude parameters: use to add VIP SNPS list and excluded SNPS list for the program.
- VIP SNPs are prioritized to be selected by the algorithm while excluded SNPs are elimiated from the selection.
NOTE: not sure that all VIP SNPs are selected and all excluded SNPs are excluded because the program weight imputation is the primary factor of choosing SNPs.
- First submission
LmTag
is a model based method to find tagSNP in SNP array desgin that maximizes imputation coverage and functional score of tag SNPs. Full details of the method is described in the method manuscript.
LmTag
is implemented in R
and C++
. LmTag
requires vcftools, bcftools, minimac3, minimac4, plink
for model construction step.
VCFtools 0.1.17
: https://vcftools.github.iobcftools 1.10.2
: https://github.com/samtools/bcftoolsplink1.9
: https://www.cog-genomics.org/plink/minimac3
: https://genome.sph.umich.edu/wiki/Minimac3minimac4
: https://genome.sph.umich.edu/wiki/Minimac4- A
C++-11
compliant compiler version ofGCC (g++ >= 4.8.2)
R
packages version3.6.0
or latter withdata.table
package installed.
# download the software:
git clone https://github.com/datngu/LmTag.git
## Move to the *LmTag* directory and do configuration for LmTag
cd LmTag
# build the C++ program
make
# export to PATH
export PATH=$PWD/bin:$PATH
cd ..
The installation assumes that vcftools, bcftools, minimac3, minimac4, plink
are available and can be call directly from your terminal promt.
If vcftools, bcftools, minimac3, minimac4, plink
are not available, please install and add these tools to PATH
variable.
OTHERWISE, THERE WILL BE ERRORS
- Only biallelic SNPS are considered.
MAF
are carefully check before runing (recommend to usefilltag
command bybcftools
before input to LmTag pipeline) as LmTag extracts MAF directly fromINFO/AF
information in vcf file.- Minimum
MAF
threshold are pre-determine in vcf filer, so you need to do filtering before providing the file toLmTag
pipeline. #CHROM
should be encoded without 'chr' character.
- Functional scores of SNPs candidates, typically extracted from the
CADD
databases [optional] - List of VIP SNPs - they will be prioritized in tag SNP selection at highest levels; typically SNPs in
GWAS catalog
orClinVar
databases that you really want to select as tag SNPs [optional] - List of bad SNPs - they will be tried to exluded in tag SNP selection; typically SNPs in repeated regions or low quality that you don't want selected as tag SNPs [optional]
We provide in this tutorial based on chromosome 10, East Asian population dataset:
- vcf file: https://zenodo.org/record/5807198/files/chr10_EAS.vcf.gz?download=1
- CADD score file: https://zenodo.org/record/5807198/files/chr10_EAS_CADD.txt?download=1
- VIP SNP list: https://zenodo.org/record/5807198/files/VIP_GWAS_CLINVAR_ALL.txt?download=1
Assumed that you have a raw vcf file: chr10_EAS.vcf.gz
in your current directory, recommended protocol to prepare vcf file is:
bcftools view chr10_EAS.vcf.gz -m2 -M2 -v snps -Q 0.9999999999:major -q 0.01:minor -e 'ALT="."' | bcftools +fill-tags | sed 's/chr//g' | bgzip > chr10_EAS_processed.vcf.gz
Now we compute LD
with plink v1.9
and extract MAF
with bcftools
:
# compute LD
mkdir plink
plink --vcf chr10_EAS_processed.vcf.gz \
--vcf-half-call 'haploid' \
--make-bed --const-fid --out ./plink/chr10_EAS_tem_file \
--threads 1 \
--memory 2000
plink --bfile ./plink/chr10_EAS_tem_file \
--r --ld-window-r2 0.2 \
--ld-window 10000 \
--ld-window-kb 1000 \
--out ./plink/chr10_EAS_ld_0.2 \
--threads 8 \
--memory 2000
mv ./plink/chr10_EAS_ld_0.2.ld ./
rm -r plink
# extract MAF
echo $'CHR\tPOS\tAF' > chr10_EAS_extracted_AF.txt
bcftools query -f '%CHROM\t%POS\t%AF\n' chr10_EAS_processed.vcf.gz >> chr10_EAS_extracted_AF.txt
We need to build a reference directory for leave one out cross validation imputation with create_imputation_ref.sh
. This step may take very long time because it will generate n reference m3vcf files with n-1 samples. n is number of sample in your vcf.gz
file. You may download pre-built reference instead of generate it yourself.
create_imputation_ref.sh -v chr10_EAS_processed.vcf.gz -o chr10_EAS_hg38_high_cov -p 16
NOTE: Pre-built imputation reference panel of populations are available for downloading:
EAS: https://zenodo.org/record/5807198/files/chr10_EAS.tar.gz?download=1
EUR: https://zenodo.org/record/5807198/files/chr10_EUR.tar.gz?download=1
SAS: https://zenodo.org/record/5807198/files/chr10_SAS.tar.gz?download=1
Assumed that you have downloaded chr10_EAS.tar.gz
, the upzip command is:
tar -xvzf chr10_EAS.tar.gz
We need to build a naive array to entablish relation between LD, MAF, and distance of SNPs with build_naive_array.R
. The idea is to sampling n SNPs uniformlly based on their index after sorting by genomic position. Next, we impute with pre-built imputation reference m3vcf with imputation_with_prebuilt_ref.sh
and compute imputaion accuracy with compute_imputation_accuracy.R
.
# size=32970 is the size (number of tag SNP selected by obtain by TagIt with the same input vcf file - read main paper for further information).
build_naive_array.R vcf=chr10_EAS_processed.vcf.gz size=32970 out=chr10_EAS_naive.txt
imputation_with_prebuilt_ref.sh -t chr10_EAS_naive.txt -r chr10_EAS_hg38_high_cov -o naive_chr10_EAS -p 16
compute_imputation_accuracy.R imputation=naive_chr10_EAS out=naive_chr10_EAS.Rdata
Finding best tagSNP (belong to chr10_EAS_naive.txt) by LmTag find
command for all SNPs.
LmTag find --tag chr10_EAS_naive.txt --ld chr10_EAS_ld_0.2.ld -o chr10_EAS_find_snp_output.txt
Now we can build model with buid_imputation_model.R
buid_imputation_model.R imputation_Rdata=naive_chr10_EAS.Rdata find_snp=chr10_EAS_find_snp_output.txt out_Rdata=chr10_EAS_model.Rdata
This step need ld file generated by plink v1.9
, AF file generated by bcftools
, and model build by ./buid_imputation_model.R
.
Now LmTag supports 2 models:
- The linear option (
model=linear
) models imputation accuracy of a linear funcion of:Imputation_accuracy ~ LD + MAF_tagSNP + MAF_taggedSNP + distance
. This model was used in previous versions.
fit_imputation_model.R model_Rdata=chr10_EAS_model.Rdata model=linear af=chr10_EAS_extracted_AF.txt ld=chr10_EAS_ld_0.2.ld ld_cutoff=0.8 out_ld=chr10_EAS_ld_fitted_model.txt
- The interaction option (
model=interaction
) models imputation accuracy of a linear funcion with interaction term of LD and MAF_taggedSNP:Imputation_accuracy ~ LD + MAF_tagSNP + MAF_taggedSNP + distance + LD:MAF_taggedSNP
.
fit_imputation_model.R model=interaction model_Rdata=chr10_EAS_model.Rdata af=chr10_EAS_extracted_AF.txt ld=chr10_EAS_ld_0.2.ld ld_cutoff=0.8 out_ld=chr10_EAS_ld_fitted_model.txt
In our experiments, the interaction model provides slightly better performance, but it is not significant. We still recommend users to use the linear model for better intepretbility.
## testing with k = 200
LmTag tag --ld_model chr10_EAS_ld_fitted_model.txt \
--eff chr10_EAS_CADD.txt \
--vip VIP_GWAS_CLINVAR_ALL.txt \
-k 200 \
-o chr10_EAS_tagSNP.txt
Input argument:
Name | Description |
---|---|
--ld_model | fitted model ld file, generated by it_imputation_model.R |
--eff | file provide infomation of effect score of input SNP |
--exclude | list of SNPs that will be try to avoid to select by the algorithm |
--vip | list of SNPs that will be try to prioritize to select by the algorithm |
-k | k value of the beam search algorithm |
-o | output file name |
LmTag output file is chr10_EAS_tagSNP.txt with following infomation
Name | Description |
---|---|
chr | chromosome of tag SNP |
pos | position of tag SNP |
id | typically isrsID of tag SNP - but depends on input ld file |
sum_score | sum of impuation score - used for weighting in tag SNP selection |
degree | degree of tag SNP in the graph |
effect_score | effect score of tag SNP |
flag | source of tag SNP - it can be normal, vip (from vip list), excluded (from excluded list) |
tagged_pos | positions of tagged SNPs |
cat chr10_EAS_tagSNP.txt | grep -v "chr" | awk '//{printf "%s\t%s\n", $1, $2}' > chr10_EAS_tagSNP_cleaned.txt
imputation_with_prebuilt_ref.sh -t chr10_EAS_tagSNP_cleaned.txt -r chr10_EAS_hg38_high_cov -o LmTag_chr10_EAS -p 16
compute_imputation_accuracy.R imputation=LmTag_chr10_EAS out=LmTag_chr10_EAS.Rdata
The Software is restricted to non-commercial research purposes.
Dat Thanh Nguyen, Quan Hoang Nguyen, Nguyen Thuy Duong, Nam S Vo, LmTag: functional-enrichment and imputation-aware tag SNP selection for population-specific genotyping arrays, Briefings in Bioinformatics, 2022;, bbac252, https://doi.org/10.1093/bib/bbac252