Skip to content

hewm2008/VCF2PCACluster

Repository files navigation

VCF2PCACluster

A new simple and efficient software to perform PCA and Clustering analysis For population VCF File

The VCF2PCACluster article has been published in BMC Bioinformatics Journal, please cited this article if possible

PMID: 38693489           DOI:10.1186/s12859-024-05770-1

1 Introduction

VCF2PCACluster is an easy-to-use tool for the PCA and clustering analysis and visualization based on VCF formatted input or Genotype.

Highlights:

  1. The result is the same with that generated by tassel,gapit and gcta , and only with the difference in precision.
  2. Functions include: 1) five kinship estimation methods, 2) PCA analysis, 3) Clustering, 4) Visualization
  3. easy-to-use that users only need to provide a VCF input
  4. memory-efficient that independent on the number of SNPs
  5. Three clustering methods, K-Means, EM Gaussian and DBSCAN
  6. Visualization in 2D or 3D plots

2 Download and Installation

The new version of VCF2PCACluster will be updated and maintained in hewm2008/VCF2PCACluster. Please click below link to download the latest version. hewm2008/VCF2PCACluster

2.1 Linux/Mac OS

Download

2.2 Pre-installation

VCF2PCACluster is for Linux/Unix/macOS only.

Before installing, please make sure the following pre-requirements are ready to use.

  1. OpenMP c/c++ command is recommended to be pre-installed
  2. g++ : g++ with --std=c++11 > 4.8+ is recommended
  3. zlib : zlib > 1.2.3 is recommended
  4. R : R with ggplot2 and scatterplot3d are recommended

2.3 Installation

Users can install it with the following options:

Option 1,we provide a static version for Linux/Unix X64

git clone https://github.com/hewm2008/VCF2PCACluster.git
cd VCF2PCACluster;	chmod 755 -R bin/*
./bin/VCF2PCACluster  -h  ### print help information

Option 2: compile from source code for Linux/Unix/macOS

git clone https://github.com/hewm2008/VCF2PCACluster.git
cd VCF2PCACluster ; chmod 755 configure  ;  ./configure;
make;   # sh make.sh 
mv VCF2PCACluster  bin/;    #     [rm *.o]

Note: For macOS , users can run the following command first. Please ensure g++-11 has been installed using the homebrew, we have successfully tested on the macOS Monterey, Apple M1 chip.

  ln -s  /opt/homebrew/bin/g++-11     /opt/homebrew/bin/g++  ;
export  PATH=/opt/homebrew/bin/:$PATH

3 Parameters description

3.1 VCF2PCACluster

3.1.1 Main parameters

	Usage: VCF2PCACluster  -InVCF  in.vcf.gz  -OutPut outPrefix [options]

		-InVCF         <str>      Input SNP VCF Format
		-InGenotype    <str>      InPut Genotype File
		-InKinship     <str>      Input SNP K Kinship File Format
		-OutPut        <str>      OutPut File Prefix(Kinship PCA etc)


		-KinshipMethod <int>      Method of Kinship [1-5],defaut [1]
		                          1:Normalized_IBS[(Yang/BaldingNicolsKinship]
		                          2:Centered_IBS(VanRaden)
		                          3:IBSKinshipImpute 4:IBSKinship 5:p_dis
		-ClusterMethod <str>      Method For Cluster[EM/Kmean/DBSCAN/None] [EM]

		-help                     Show more Parameters and help [hewm2008]

General usage:

    ### running without pop.info
    #   VCF2PCACluster	-InVCF	Khuman.vcf.gz	-OutPut	OUT
    ### running with  pop.info
        VCF2PCACluster	-InVCF	Khuman.vcf.gz	-OutPut	OUT	-InSampleGroup	pop.info 

3.1.2 Detailed parameters

	# for more Help document please see the manual.	Para [-i] is show for [-InVCF], Para [-o] is show for [-OutPut]

	Usage: VCF2PCACluster  -InVCF in.vcf.gz  -OutPut outPrefix [options]

		-InVCF         <str>      Input SNP VCF Format
		-InKinship     <str>      Input SNP K Kinship File Format
		-OutPut        <str>      OutPut File Prefix(Kinship PCA etc)

		-KinshipMethod <int>      Method of Kinship [1-5],defaut [1]
		                          1:Normalized_IBS(Yang/BaldingNicolsKinship)
		                          2:Centered_IBS(VanRaden)
		                          3:IBSKinshipImpute 4:IBSKinship 5:p_dis
		-ClusterMethod <str>      Method For Cluster[EM/Kmean/DBSCAN/None] [EM]

		-help          v1.40      Show more Parameters and help [hewm2008]

	    InFile:
		-InGenotype    <str>      InPut Genotype File for no VCF file
		-InSubSample   <str>      Only keep samples from subsample List for PCA[ALLsample]
		-InSampleGroup <str>      InFile of sample Group info,format(sample groupA)

	    SNP Filtering:
		-MAF           <float>    Min minor allele frequency filter [0.001]
		-Miss          <float>    Max ratio of miss allele filter [0.25]
		-Het           <float>    Max ratio of het allele filter [1.00]
		-HWE           <float>    Exact test of Hardy-Weinberg Equilibrium for SNP Pvalue[0]
		-Fchr          <str>      Filter the chrX chr[chrX,chrY,X,Y]
		-KeepRemainVCF            keep the VCF after filter

	    Clustering:
		-RandomCenter             Random diff-center to Re-Run Cluster for Kmean
		-BestKManually <int>      manually set the Best K (Num of Cluster) (auto)
		-BestKRatio    <float>    Get the best K Cluster by deta-SSE Ratio[0.15]
		-MinPointNum   <int>      Minimum point number of D-cluster[4]
		-Epsilon       <float>    Epsilon for DBSCAN_Distance/EM_convergence (auto)
		-Iterations    <int>      iterations number for EM clustering[1000]

	    OutPut:
		-PCnum         <int>      Num of PC eig [10]

3.2.2 Other parameters

VCF2PCACluster also provides two custom scripts (Plot2Deig and Plot3Deig) for 2D or 3D plots, the brief parameters of the plot script are as follows:

perl    Plot2Deig/Plot3Deig -h

	Version:1.40         [email protected]

	Usage: Plot2Deig/Plot3Deig -InFile pca.eigenvec -OutPut Fig


		Options

		-InFile       <s> : InPut PCA.eigenvec File
		-OutPut       <s> : OutPut svg file result

		-help             : Show more help with more parameter

		-ColShap          : colour <=> shape for cluster or group
		-ShowEval         : Show eval%(PC percentages) on the fig
		-Columns      <s> : the columns to plot a:b [4:5]
		-ColorBrewer  <s> : the color brewer for points [Dark2]
		-Title        <s> : title (legend) [PCA]

		-BinDir       <s> : The Bin Dir of gnuplot/R/convert [$PATH]


			[email protected] / [email protected]
			   join the QQ Group : 125293663

3.3 Outputs

outFile Description
out.kinship Kinship matrix file
out.eigenvec the best clustering and PCA result
out.eigenval PCA eigen values
out.PC1_PC2.pdf PCA and clustering 2D plot
out.PC1PC2PC3.pdf PCA and clustering 3D plot

4 Examples

See more detailed usage in Chinese Documentation. See more detailed usage in English Documentation.

../../bin/VCF2PCACluster      -InVCF        in.vcf.gz     -OutPut outPrefix

Two examples were provided in the directory of Example/example*.

  • Example 1) a small test dataset

We randomly selected 1,194 SNPs on chromosome (chr) 22 from 1000 Genome Project with 203 samples including CEU(49) , CHB(46) , JPT(56)and YRI (52)for analysis.


PCA and EM Gaussian clustering plot using PC1 and PC2
PC12.png
PCA and EM Gaussian clustering plot using PC1 and PC3
PC13.png
PCA and EM Gaussian clustering plot using PC1,PC2, and PC3
PC3D.png

  • Example 2) a large test dataset

To test the accuracy and the efficiency of VCF2PCACluster, we downloaded data from 1000 Genome Project to test following softwares, and used the chr22 (minimal chromosome SNP database) (2504 sample with 1,055,401 SNP numbers) to benchmark these softwares. The result is the same with that generated by tassel and gcta64, Please see more details in the manual. Waiting time ~12.5min with 8 threads;
Memory usage is about 0.1G, we test for all chr1-22(81271745 site) VCF, the memory usage of VCF2PCACluster is still 0.1G, but the Plink2 exceeds 200g, and returns an error.

echo Start Time :
date
##   download the real data  ###
#wget  -c https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
#wget  -c https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel
# cut -f  1,3   integrated_call_samples_v3.20130502.ALL.panel  > sample.group ; gzip  sample.group
time	../../bin/VCF2PCACluster  -InVCFALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz	-InSampleGroup sample.group.gz 	-OutPut	OUT1
##  to re-set the best K (4--->5)
time	../../bin/VCF2PCACluster  -InKinship OUT1.Normalized_IBS.Kinship -InSampleGroup sample.group.gz 	-OutPut	OUT2   -BestKManually  5
echo End Time :
date

PCA and clustering Result: the correlation coefficient for prior group labels and clustering is 0.995 calculated using cor function in R

ALL_chr22.png

5 Advantages

  • fast and low memory usage
  • Simple and easy to use (-i -o)
  • five kinship estimation methods
  • three clustering methods
  • Free of installation
  • only one step from VCF to the final plot
  • 2D or 3D plots of PCA and clustering results

6 Contact

If any question, please

######################swimming in the sky and flying in the sea #############################