N characters introduced into *indels.csv #162

alan-tracey · 2024-07-04T08:18:53Z

Description of the bug

Hi, I’ve just run crisprseq using the targeted pipeline with a read1.fastq.gz only. I heavily quality filtered the input reads, removing any reads containing N characters. In the output indels.csv, there are many cases of N characters being reported in the "pre_ins_nt", "ins_nt" and "post_ins_nt" columns. When I check these reads in the input fastq file, the reported N characters are [ACGT] characters with Q value > 30. For the handful of reads I’ve looked at with these reported N characters, the majority called insertion (normal ACGT sequence) can be found in the input sequence, further suggesting these N calls could be erroneous results. My data is confidential so I unfortunately cannot share it. However, I notice that in the test dataset output, there are N's reported in some of the insertion outcomes which don't occur in the input reads, eg M00724:1:000000000-DC7GJ:1:1102:19229:3583 in hCas9-TRAC-a_R*.fastq.gz - this has AGA-N-CAT.

Command used and terminal output

No response

Relevant files

No response

System information

No response

mirpedrol · 2024-07-19T16:01:21Z

Hello @alan-tracey, thanks for reporting this.
I had a look at the hCas9-TRAC sample from the test data and in this case the masked bases are due to bad quality. Even if the original reads have good quality, we use pear to join R1 and R2 reads, this computes a new quality score based on the overlapping bases, if this base is not the same for R1 and R2, the new quality will be lower.
You can check this assembled fastq files from the output directory preprocessing/pear to make sure that this is the same that happens with your samples.

alan-tracey · 2024-07-19T16:04:41Z

Hi Júlia In my case I don't think that explains it since I am not using paired end sequencing, rather I am using just R1 (R2 is only used to capture a barcode sequence and is then discarded). Thanks, Alan

…

On Fri, 19 Jul 2024 at 17:01, Júlia Mir Pedrol ***@***.***> wrote: Hello @alan-tracey <https://github.com/alan-tracey>, thanks for reporting this. I had a look at the hCas9-TRAC sample from the test data and in this case the masked bases are due to bad quality. Even if the original reads have good quality, we use pear <https://cme.h-its.org/exelixis/web/software/pear/doc.html> to join R1 and R2 reads, this computes a new quality score based on the overlapping bases, if this base is not the same for R1 and R2, the new quality will be lower. You can check this assembled fastq files from the output directory preprocessing/pear to make sure that this is the same that happens with your samples. — Reply to this email directly, view it on GitHub <#162 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A2SZGSDVDJO6EGHDFXJW56DZNEZ6NAVCNFSM6AAAAABKLBD672VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGUYTGOJSGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Alan Tracey Bioinformatician T +44 (0)1223 787297 ***@***.*** ***@***.***> The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom ***@***.*** ***@***.***> | www.bit.bio Follow us <https://twitter.com/bitbio> <https://www.linkedin.com/company/bitbioltd/> [image: bit.bio] <http://www.bit.bio/> Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer. Alan Tracey Bioinformatician T +44 (0)1223 787297 ***@***.*** The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom ***@***.*** | www.bit.bio Follow us Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.

mirpedrol · 2024-07-19T16:11:38Z

Could you check if the Ns are actually added by seqtk? You can find the output files after this tool in preprocessing/seqtk. By default we are using the parameter -q 20 -L 80 -n N for seqtk, which should mask bases with a quality lower than 20, are you modifying these parameters, or running the pipeline with all the defaults?

alan-tracey · 2024-07-19T16:16:02Z

Hi Júlia I've checked and there are reads in preprocessing/seqtk that contain 'N's. I have run the pipeline with default settings and --overrepresented. Thanks, Alan

…

On Fri, 19 Jul 2024 at 17:12, Júlia Mir Pedrol ***@***.***> wrote: Could you check if the Ns are actually added by seqtk? You can find the output files after this tool in preprocessing/seqtk. By default we are using the parameter -q 20 -L 80 -n N for seqtk, which should mask bases with a quality lower than 20, are you modifying these parameters, or running the pipeline with all the defaults? — Reply to this email directly, view it on GitHub <#162 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A2SZGSABTIW3KCVCY6VOLPTZNE3FBAVCNFSM6AAAAABKLBD672VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGUZDSNZVGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Alan Tracey Bioinformatician T +44 (0)1223 787297 ***@***.*** ***@***.***> The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom ***@***.*** ***@***.***> | www.bit.bio Follow us <https://twitter.com/bitbio> <https://www.linkedin.com/company/bitbioltd/> [image: bit.bio] <http://www.bit.bio/> Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer. Alan Tracey Bioinformatician T +44 (0)1223 787297 ***@***.*** The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom ***@***.*** | www.bit.bio Follow us Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.

mirpedrol · 2024-07-19T16:21:50Z

Is the quality of those Ns higher than 20?

mirpedrol · 2024-07-19T16:26:11Z

If you are using --overrepresented, the input reads to seqtk are under <outdir>/preprocessing/cutadapt, could you doublecheck if the same reads which contain Ns after seqtk, also contain these Ns after cutadapt and not in the input raw fastq files?
Thanks for helping with this debugging :)

alan-tracey · 2024-07-19T16:35:43Z

It looks like there are N bases with quality <20 (here comparing seqtk vs cutadapt as you suggested): zgrep -A4 "M07996:142:000000000-LKRPM:1:1101:15613:1876" S04B2_CIITA.seqtk-seq.fastq.gz @M07996:142:000000000-LKRPM:1:1101:15613:1876 1:N:0:GCCTTCGGGA+CCCACGATTT GGTGACTGAGCATTGTCTTCCCTCCCAGGCAGCTCACAGTGTGCCACCNNGGANTTGGGGCCCCTAGAAGGTGGCTTACCTGGAGCTTCTTAACAGCGATGCTGACCCCGTGTGCCTCTACCACTTCTATNACCNNNTGGN + ***@***.*** <1==G1..<GH/ @M07996:142:000000000-LKRPM:1:1101:17082:1937 1:N:0:GCCTTCGGTA+CCAACGATTT (base) ***@***.*** Downloads % zgrep -A4 "M07996:142:000000000-LKRPM:1:1101:15613:1876" S04B2_CIITA.trim.fastq.gz @M07996:142:000000000-LKRPM:1:1101:15613:1876 1:N:0:GCCTTCGGGA+CCCACGATTT GGTGACTGAGCATTGTCTTCCCTCCCAGGCAGCTCACAGTGTGCCACCATGGAGTTGGGGCCCCTAGAAGGTGGCTTACCTGGAGCTTCTTAACAGCGATGCTGACCCCGTGTGCCTCTACCACTTCTATGACCAGATGGA + ***@***.*** <1==G1..<GH/ @M07996:142:000000000-LKRPM:1:1101:14827:1933 1:N:0:GCCTTCGGTA+CCAACGATTT

…

On Fri, 19 Jul 2024 at 17:26, Júlia Mir Pedrol ***@***.***> wrote: If you are using --overrepresented, the input reads to seqtk are under <outdir>/preprocessing/cutadapt, could you doublecheck if the same reads which contain Ns after seqtk, also contain these Ns after cutadapt and not in the input raw fastq files? Thanks for helping with this debugging :) — Reply to this email directly, view it on GitHub <#162 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A2SZGSFDMENKHC3QCMDKL4DZNE43RAVCNFSM6AAAAABKLBD672VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGU2TANRXGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Alan Tracey Bioinformatician T +44 (0)1223 787297 ***@***.*** ***@***.***> The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom ***@***.*** ***@***.***> | www.bit.bio Follow us <https://twitter.com/bitbio> <https://www.linkedin.com/company/bitbioltd/> [image: bit.bio] <http://www.bit.bio/> Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer. Alan Tracey Bioinformatician T +44 (0)1223 787297 ***@***.*** The Dorothy Hodgkin Building Babraham Research Campus Cambridge CB22 3FH United Kingdom ***@***.*** | www.bit.bio Follow us Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.

alan-tracey added the bug Something isn't working label Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N characters introduced into *indels.csv #162

N characters introduced into *indels.csv #162

alan-tracey commented Jul 4, 2024

mirpedrol commented Jul 19, 2024

alan-tracey commented Jul 19, 2024 via email

mirpedrol commented Jul 19, 2024

alan-tracey commented Jul 19, 2024 via email

mirpedrol commented Jul 19, 2024

mirpedrol commented Jul 19, 2024

alan-tracey commented Jul 19, 2024 via email

N characters introduced into *indels.csv #162

N characters introduced into *indels.csv #162

Comments

alan-tracey commented Jul 4, 2024

Description of the bug

Command used and terminal output

Relevant files

System information

mirpedrol commented Jul 19, 2024

alan-tracey commented Jul 19, 2024 via email

mirpedrol commented Jul 19, 2024

alan-tracey commented Jul 19, 2024 via email

mirpedrol commented Jul 19, 2024

mirpedrol commented Jul 19, 2024

alan-tracey commented Jul 19, 2024 via email