Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARNING: AD must list numbers #42

Open
ireneIta opened this issue Feb 15, 2021 · 6 comments
Open

WARNING: AD must list numbers #42

ireneIta opened this issue Feb 15, 2021 · 6 comments

Comments

@ireneIta
Copy link

Hi Chase,
thank you for this useful tool!
I am having some issues with a set of data. I got a warning, and SNPGenie terminates. I am running SNPGenie with --vcfformat=4; I got the vcf by merging individual vcfs with bcftools.
The warning is

WARNING: the line is:

NC_005967.2 4009 . T A,G 687.24 . BaseQRankSum=-0.486;Dels=0;FS=0;HRun=1;HaplotypeScore=1.6229;MQ=13.77;MQ0=7;MQRankSum=4.129;QD=7.29;ReadPosRankSum=-1.301;DP=1156;AF=1,0.5;AN=64;AC=38,21 GT:AD:DP:GQ:PL 1/1:36,25:61:9.09:477,9,0,.,.,.

For vcfformat==4, AD must list numbers of reads for two SNPs as follows: ,,. SNPGenie TERMINATED.

AD is actually a list of numbers (in the warning= 36,25), so I don't get the message... Could you please help me find out what's the problem?

I have another question: in the SNPGenie_parameters.txt file that is generated over the run, it is reported "COMPLEMENT MODE", which is "Yes" when running the software on "fw" data and is "No" when running it on revcom data. However, to my understanding (and according to the results of the few run), SNPGenie does not reverse-complement the data, so the results are only relative to one strand. Am I correct?

thank you very much!!
best wishes
Irene

@singing-scientist
Copy link
Contributor

singing-scientist commented Feb 15, 2021

Greetings @ireneIta and thanks for using SNPGenie! Sorry for the issue. Indeed the AD entry seems correct, so it is likely an issue with file encoding. I'd be happy to give it a try myself if you can send a reproducible example (reference / gtf / VCF file combo) — here or private email is fine.

It is correct that SNPGenie does not reverse complement the data, and one run of SNPGenie gives results for one strand only — although it is strand-AWARE in the sense that, if there are minus-strand entries in the GTF file, SNPGenie will correctly label those sites as genic and codons from two genes overlapping on opposite strands will still be flagged as such. It's a bit of a clunky fix, and an artifact of having been originally design for viral genomes, but the practical result is that SNPGenie must be run once for each strand with separate input for each strand.

Let me know if that helps!

@ireneIta
Copy link
Author

Dear Chase,
thank you very much for your quick reply
I have attached here one of the sets of data that is giving me the problem. Please let me know if there are issues with the attachment, I'll send them by email.
I ran the same analysis on a different set of data, using the same gtf and reference fasta but a different vcf, and it worked perfectly. I guess it could be an issue with the vcf but I cannot find the problem... I really appreciate your help!!!

thank you also for the clarification on the fw/revcom and for confirming that SNPGenie must be run separately for the two strands.

test_SNPGenie.zip

@singing-scientist
Copy link
Contributor

Ah, I see! This appears to be due to the merging of the VCF files — if you look at the "sample" columns at the right-most side of the output, the merging have forced each sample to have a record for each site that is variable in even one sample, so that the majority of records are "blank" in the form of "./.:.:.:.:.". Apparently I did not deal with this case for triallelic sites. If it it possible to run SNPGenie in a different mode on every individual VCF, that would circumvent the problem. Another quicker fix worth trying is to manually FIND/REPLACE, on this line only, as follows:

FIND: ./.:.:.:.:.
REPLACE: ./.:.,.:.:.:.

I can try to code for this case, but it will be several days before I can get around to it. Please let me know!

@ireneIta
Copy link
Author

thank you for the hints!
I have tried the FIND/REPLACE fix, but the warning still shows, unchanged.
I'll give it a go with the other option and let you know if it solves the issue.
I had a look at the original vcf of the problematic sample: in that sample the locus is monoallelic, whereas in other samples it is biallelic... and this may generate confusion in the merging with bcftools and following "decomposition" by SNPGenie (is this what you meant with triallelic sites)? maybe it's not related, but I am sure we can find it out with the approach you suggested.

I'll keep you posted.
thank you again for your great support!

@singing-scientist
Copy link
Contributor

Please forgive my dropping the ball on this! Did you find a solution that works? If you'd like to to give it a go, I did receive your files but please also provide the exact command-line arguments you ran with them. But perhaps it's already been solved!

Best,
Chase

@ireneIta
Copy link
Author

Dear Chase,
I am sorry for the delay, I am still working on it. I ran once the script and, after three days of analysis with no errors... I realized there was a typo in the name of the reference fasta file!
I am running it again. It will take some time because of the number of samples and chromosomes.
However, the results look ok, so far, so the issue seems to be solved.
Thank you for your support!!

best,
Irene

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants