Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to build a manifest.tsv file. Can you provide a sample file? #116

Open
joybio opened this issue Aug 8, 2022 · 18 comments
Open

How to build a manifest.tsv file. Can you provide a sample file? #116

joybio opened this issue Aug 8, 2022 · 18 comments

Comments

@joybio
Copy link

joybio commented Aug 8, 2022

comma or tab separated? how about pairwise sequencing?
can I use compressed files, like gz or tar.gz?

here is an example of my manifest.tsv, is this ok?

#########################################################################
name vcf reads reads
201550 NGS_Mapping/SNP/gvcf/201550.gvcf.gz 01.Cleandata/201550/201550.1.fq.clean.gz 01.Cleandata/201550/201550.2.fq.clean.gz
183207 NGS_Mapping/SNP/gvcf/183207.gvcf.gz 01.Cleandata/183207/183207.1.fq.clean.gz 01.Cleandata/183207/183207.2.fq.clean.gz
#########################################################################

thanks.

@martinghunt
Copy link
Member

It's a tab-delimited file, as described here:
https://github.com/iqbal-lab-org/minos/wiki/Minos-for-joint-genotyping

@joybio
Copy link
Author

joybio commented Aug 8, 2022

thanks for the links!
but I got another error as followed:

Command error:
Traceback (most recent call last):
File "/share/data3//miniconda3/envs/minos/bin/minos", line 8, in
sys.exit(main()) File "/share/data3/
/miniconda3/envs/minos/lib/python3.9/site-packages/minos/main.py", line 300, in main
args.func(args)
File "/share/data3//miniconda3/envs/minos/lib/python3.9/site-packages/minos/tasks/vcf_cluster.py", line 6, in run
tracker.cluster(
File "/share/data3/
/miniconda3/envs/minos/lib/python3.9/site-packages/cluster_vcf_records/variant_tracking.py", line 672, in cluster
split_ranges = self.find_parallel_cluster_split_points(max_ref_length, cpus)
File "/share/data3/*/miniconda3/envs/minos/lib/python3.9/site-packages/cluster_vcf_records/variant_tracking.py", line 667, in find_parallel_cluster_split_points
split_ranges.append([split_ranges[-1][-1] + 1, len(self.variants) - 1])
IndexError: list index out of range
##############################################
my manifest.tsv:

name reads vcf
201550 /home/NGS/201550.dup.bam /home/NGS/201550.gvcf.gz
...
###############################################

@martinghunt
Copy link
Member

Is it a lot of samples? If possible could you share the VCF files with me please and I can debug

@joybio
Copy link
Author

joybio commented Aug 8, 2022

it's about 1000 vcf files. the following text is the head of vcf files (all like this).
#################################################################
#description
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 180233
NZ_CP042314.1 1 . C <NON_REF> . . END=87 GT:DP:GQ:MIN_DP:PL 0:202:99:58:0,1800
NZ_CP042314.1 88 . T C,<NON_REF> 13506.04 . DP=356;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=1281600,356 GT:AD:DP:GQ:PL:SB
NZ_CP042314.1 89 . C <NON_REF> . . END=528 GT:DP:GQ:MIN_DP:PL 0:602:99:345:0,1800
NZ_CP042314.1 529 . A C,<NON_REF> 28517.04 . DP=647;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=2329200,647 GT:AD:DP:GQ:PL:SB
NZ_CP042314.1 530 . T <NON_REF> . . END=543 GT:DP:GQ:MIN_DP:PL 0:630:99:624:0,1800
NZ_CP042314.1 544 . G A,<NON_REF> 28195.04 . DP=645;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=2322000,645 GT:AD:DP:GQ:PL:SB
NZ_CP042314.1 545 . C <NON_REF> . . END=650 GT:DP:GQ:MIN_DP:PL 0:577:99:529:0,1800
NZ_CP042314.1 651 . A C,<NON_REF> 21344.04 . DP=541;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=1947600,541 GT:AD:DP:GQ:PL:SB
NZ_CP042314.1 652 . G <NON_REF> . . END=1038 GT:DP:GQ:MIN_DP:PL 0:597:99:494:0,1800
NZ_CP042314.1 1039 . A G,<NON_REF> 19547.04 . DP=512;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=1843200,512 GT:AD:DP:GQ:PL:SB
NZ_CP042314.1 1040 . C <NON_REF> . . END=1369 GT:DP:GQ:MIN_DP:PL 0:562:99:439:0,1800
NZ_CP042314.1 1370 . C A,<NON_REF> 22301.04 . DP=580;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=2088000,580 GT:AD:DP:GQ:PL:SB
NZ_CP042314.1 1371 . G <NON_REF> . . END=2005 GT:DP:GQ:MIN_DP:PL 0:542:99:503:0,1800
...
215663 lines
....

i'm reading this function (find_parallel_cluster_split_points) , what information are we processing in this function.
Does the str "<NON_REF>" affect?

thanks!

@martinghunt
Copy link
Member

I've just tested and having <NON_REF> for an allele will break the pipeline later on (the gramtools_build stage). I expect bad things will also be happening internally before the actual failure.

There's also two nextflow parameters that may be causing your VCFs to be excluded. Their defaults made sense for TB, but maybe not for your data.

  1. --max_variants_per_sample. Default is 5000. If a VCF has more records than this, then it is excluded (on the basis that it must be too far away from the reference genome)
  2. --max_genome_proportion_per_sample. Default is 0.05. This is saying (approximately) that if the sample is more than 5% away from the reference, then it is excluded. The actual test is: (total length of REF alleles) / (genome size) > 0.05.

Sorry, these should be documented - is on my to do list.

@joybio joybio closed this as completed Aug 8, 2022
@joybio
Copy link
Author

joybio commented Aug 8, 2022

thank you very much!

what should i do with "<NON_REF>", can you add a if "sentence" to process this?

@martinghunt
Copy link
Member

I'll add in a step that removes all non-ACGT alleles. You could wait for that fix, or remove the <NON_REF> alleles from your VCF files.

@joybio
Copy link
Author

joybio commented Aug 8, 2022

can i reset these two params? or just add these two params in the nextflow command lines?

I've just tested and having <NON_REF> for an allele will break the pipeline later on (the gramtools_build stage). I expect bad things will also be happening internally before the actual failure.

There's also two nextflow parameters that may be causing your VCFs to be excluded. Their defaults made sense for TB, but maybe not for your data.

  1. --max_variants_per_sample. Default is 5000. If a VCF has more records than this, then it is excluded (on the basis that it must be too far away from the reference genome)
  2. --max_genome_proportion_per_sample. Default is 0.05. This is saying (approximately) that if the sample is more than 5% away from the reference, then it is excluded. The actual test is: (total length of REF alleles) / (genome size) > 0.05.

Sorry, these should be documented - is on my to do list.

@joybio
Copy link
Author

joybio commented Aug 8, 2022

thanks, i'll wait for the update.

I'll add in a step that removes all non-ACGT alleles. You could wait for that fix, or remove the <NON_REF> alleles from your VCF files.

@martinghunt
Copy link
Member

can i reset these two params? or just add these two params in the nextflow command lines?

Add them to the nextflow run command, like nextflow run --max_variants_per_sample 1000000 ...etc.

@joybio
Copy link
Author

joybio commented Aug 8, 2022

ok.

can i reset these two params? or just add these two params in the nextflow command lines?

Add them to the nextflow run command, like nextflow run --max_variants_per_sample 1000000 ...etc.

@joybio
Copy link
Author

joybio commented Aug 8, 2022

hi. i used 1 vcf file as input, replace("<NON_REF>",".") in the vcf and add "--max_variants_per_sample 100000000 --max_genome_proportion_per_sample 1" to the nextflow run command, got the same error too.

@martinghunt
Copy link
Member

can you share the vcf file?

@joybio
Copy link
Author

joybio commented Aug 9, 2022

#description
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 201550
NZ_CP042314.1 1 . C . . . END=392 GT:DP:GQ:MIN_DP:PL 0:0:0:0:0,0
NZ_CP042314.1 393 . T . . . END=399 GT:DP:GQ:MIN_DP:PL 0:4:99:4:0,135
NZ_CP042314.1 400 . C . . . END=403 GT:DP:GQ:MIN_DP:PL 0:4:90:4:0,90
NZ_CP042314.1 404 . G . . . END=404 GT:DP:GQ:MIN_DP:PL 0:4:45:4:0,45
NZ_CP042314.1 405 . A . . . END=1199 GT:DP:GQ:MIN_DP:PL 0:0:0:0:0,0
NZ_CP042314.1 1200 . A . . . END=1209 GT:DP:GQ:MIN_DP:PL 0:1:42:1:0,42
NZ_CP042314.1 1210 . C . . . END=1228 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45
NZ_CP042314.1 1229 . G . . . END=1231 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0
NZ_CP042314.1 1232 . G . . . END=1233 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45
NZ_CP042314.1 1234 . G . . . END=1237 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0
NZ_CP042314.1 1238 . G . . . END=1239 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45
NZ_CP042314.1 1240 . C . . . END=1243 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0
NZ_CP042314.1 1244 . G . . . END=1244 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45
NZ_CP042314.1 1245 . G . . . END=1245 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0
NZ_CP042314.1 1246 . G . . . END=1248 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45
NZ_CP042314.1 1249 . T . . . END=1250 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0
NZ_CP042314.1 1251 . A . . . END=1253 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45
NZ_CP042314.1 1254 . C . . . END=1255 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0
NZ_CP042314.1 1256 . C . . . END=1256 GT:DP:GQ:MIN_DP:PL 0:7:45:7:0,45
NZ_CP042314.1 1257 . A . . . END=1257 GT:DP:GQ:MIN_DP:PL 0:7:0:7:0,0
NZ_CP042314.1 7892 . C G,. 0.01 . BaseQRankSum=0.000;DP=2;MLEAC=0,0;MLEAF=0.00,0.00;MQRankSum=-0.674;RAW_MQandDP=5200,2;ReadPosRankSu
m=0.674 GT:AD:DP:GQ:PL:SB 0:1,1,0:2:0:0,0,45:1,0,0,1
NZ_CP042314.1 7893 . C . . . END=7893 GT:DP:GQ:MIN_DP:PL 0:2:72:2:0,72
NZ_CP042314.1 7894 . G GT,. 0 . BaseQRankSum=0.000;DP=2;MLEAC=0,0;MLEAF=0.00,0.00;MQRankSum=-0.674;RAW_MQandDP=5200,2;ReadPosRankSu
m=0.674 GT:AD:DP:GQ:PL:SB 0:1,1,0:2:0:0,0,45:1,0,0,1

############################################

there is only 1 difference: the character "<NON_REF>" was transformed to "."

@martinghunt
Copy link
Member

I think the cause is that all of the records are getting ignored either because there's no GT field, or because the GT calls are reference, not alt. I've added a description of which alleles in the input get used here:
https://github.com/iqbal-lab-org/minos/wiki/Minos-in-single-sample-mode#input-vcf-files

I've also made a toy data set so you can test the installation. Instructions are here:
https://github.com/iqbal-lab-org/minos/wiki/Minos-test-data

You'll need to update to the new release 0.12.4 to get the toy data, and that version will also handle the <xyz> style alleles.

@joybio
Copy link
Author

joybio commented Aug 12, 2022

error messages (test data):
1, gramtools build: error: the following arguments are required: -o/--gram_dir

i have figured it out.
line 113 in nextflow/regenotype.nf.
--gram-dir ==> --gram_dir
--kmer-size ==> --kmer_size
#############################################################
2, [4f/75d978] process > parse_manifest [100%] 1 of 1 ✔
[c2/b9a58e] process > make_vcf_for_gramtools:vcf_merge [100%] 1 of 1 ✔
[41/e27f38] process > make_vcf_for_gramtools:vcf_cluster [100%] 1 of 1 ✔
[0d/109185] process > gramtools_build [100%] 1 of 1 ✔
[1d/1ce400] process > minos (1) [100%] 2 of 2, failed: 2 ✔
[- ] process > make_per_sample_vcfs_dir -
[- ] process > ivcf_merge_chunks -
[- ] process > ivcf_final_merge -
[f1/634237] NOTE: Process minos (2) terminated with an error exit status (1) -- Error is ignored
[1d/1ce400] NOTE: Process minos (1) terminated with an error exit status (1) -- Error is ignored
#log
[minos 12-08-2022 17:47:26 INFO] Depencency check: vcfbreakmulti Unknown vcfbreakmulti
[minos 12-08-2022 17:47:26 INFO] Depencency check: vcfallelicprimitives Unknown vcfallelicprimitives
[minos 12-08-2022 17:47:26 INFO] Depencency check: vcfuniq Unknown vcfuniq
[minos 12-08-2022 17:47:26 INFO] Depencency check: vt NOT_FOUND /share/data1/*/soft/miniconda3/envs/minos/bin/vt

I'm sure that vt is ok. what should I do?

@joybio
Copy link
Author

joybio commented Aug 12, 2022

another question:
what should I do when GT calls are reference, not alt.
########################################################
my vcf:
NZ_CP042314.1 1 . C . . . END=392 GT:DP:GQ:MIN_DP:PL 0:0:0:0:0,0
NZ_CP042314.1 393 . T . . . END=399 GT:DP:GQ:MIN_DP:PL 0:4:99:4:0,135
your test:
ref_name 100 . C A . . . GT 0/0
ref_name 101 . C . . . GT 1/1
ref_name 102 . T A,G . . . GT 0/1
ref_name 103 . G A,C, . . . GT 1/2

@martinghunt
Copy link
Member

The vt line in the log looks strange: I wouldn't expect a * in that path. What does which vt return for you?

For ref calls - you'll need to change the GT to be allele that you want considered for genotyping. But also in your VCF the ALT column has . - this will need to be changed to a nucleotide sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants