-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
N characters introduced into *indels.csv #162
Comments
Hello @alan-tracey, thanks for reporting this. |
Hi Júlia
In my case I don't think that explains it since I am not using paired end
sequencing, rather I am using just R1 (R2 is only used to capture a barcode
sequence and is then discarded).
Thanks,
Alan
…On Fri, 19 Jul 2024 at 17:01, Júlia Mir Pedrol ***@***.***> wrote:
Hello @alan-tracey <https://github.com/alan-tracey>, thanks for reporting
this.
I had a look at the hCas9-TRAC sample from the test data and in this case
the masked bases are due to bad quality. Even if the original reads have
good quality, we use pear
<https://cme.h-its.org/exelixis/web/software/pear/doc.html> to join R1
and R2 reads, this computes a new quality score based on the overlapping
bases, if this base is not the same for R1 and R2, the new quality will be
lower.
You can check this assembled fastq files from the output directory
preprocessing/pear to make sure that this is the same that happens with
your samples.
—
Reply to this email directly, view it on GitHub
<#162 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2SZGSDVDJO6EGHDFXJW56DZNEZ6NAVCNFSM6AAAAABKLBD672VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGUYTGOJSGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Alan Tracey
Bioinformatician
T +44 (0)1223 787297
***@***.*** ***@***.***>
The Dorothy Hodgkin Building
Babraham Research Campus
Cambridge CB22 3FH
United Kingdom
***@***.*** ***@***.***> | www.bit.bio
Follow us
<https://twitter.com/bitbio>
<https://www.linkedin.com/company/bitbioltd/>
[image: bit.bio] <http://www.bit.bio/>
Notice: This message is the property of Bit Bio Ltd and contains
information that may be confidential and/or privileged. If you are not the
intended recipient, you should not use, disclose or take any action based
on this message. If you have received this transmission in error, please
immediately contact the sender by return e-mail and delete this e-mail, and
any attachments, from any computer.
Alan Tracey
Bioinformatician
T +44 (0)1223 787297
***@***.***
The Dorothy Hodgkin Building
Babraham Research Campus
Cambridge CB22 3FH
United Kingdom
***@***.*** | www.bit.bio
Follow us
Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.
|
Could you check if the Ns are actually added by |
Hi Júlia
I've checked and there are reads in preprocessing/seqtk that contain 'N's.
I have run the pipeline with default settings and --overrepresented.
Thanks,
Alan
…On Fri, 19 Jul 2024 at 17:12, Júlia Mir Pedrol ***@***.***> wrote:
Could you check if the Ns are actually added by seqtk? You can find the
output files after this tool in preprocessing/seqtk. By default we are
using the parameter -q 20 -L 80 -n N for seqtk, which should mask bases
with a quality lower than 20, are you modifying these parameters, or
running the pipeline with all the defaults?
—
Reply to this email directly, view it on GitHub
<#162 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2SZGSABTIW3KCVCY6VOLPTZNE3FBAVCNFSM6AAAAABKLBD672VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGUZDSNZVGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Alan Tracey
Bioinformatician
T +44 (0)1223 787297
***@***.*** ***@***.***>
The Dorothy Hodgkin Building
Babraham Research Campus
Cambridge CB22 3FH
United Kingdom
***@***.*** ***@***.***> | www.bit.bio
Follow us
<https://twitter.com/bitbio>
<https://www.linkedin.com/company/bitbioltd/>
[image: bit.bio] <http://www.bit.bio/>
Notice: This message is the property of Bit Bio Ltd and contains
information that may be confidential and/or privileged. If you are not the
intended recipient, you should not use, disclose or take any action based
on this message. If you have received this transmission in error, please
immediately contact the sender by return e-mail and delete this e-mail, and
any attachments, from any computer.
Alan Tracey
Bioinformatician
T +44 (0)1223 787297
***@***.***
The Dorothy Hodgkin Building
Babraham Research Campus
Cambridge CB22 3FH
United Kingdom
***@***.*** | www.bit.bio
Follow us
Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.
|
Is the quality of those Ns higher than 20? |
If you are using |
It looks like there are N bases with quality <20 (here comparing seqtk vs
cutadapt as you suggested):
zgrep -A4 "M07996:142:000000000-LKRPM:1:1101:15613:1876"
S04B2_CIITA.seqtk-seq.fastq.gz
@M07996:142:000000000-LKRPM:1:1101:15613:1876 1:N:0:GCCTTCGGGA+CCCACGATTT
GGTGACTGAGCATTGTCTTCCCTCCCAGGCAGCTCACAGTGTGCCACCNNGGANTTGGGGCCCCTAGAAGGTGGCTTACCTGGAGCTTCTTAACAGCGATGCTGACCCCGTGTGCCTCTACCACTTCTATNACCNNNTGGN
+
***@***.***
<1==G1..<GH/
@M07996:142:000000000-LKRPM:1:1101:17082:1937 1:N:0:GCCTTCGGTA+CCAACGATTT
(base) ***@***.*** Downloads % zgrep -A4
"M07996:142:000000000-LKRPM:1:1101:15613:1876" S04B2_CIITA.trim.fastq.gz
@M07996:142:000000000-LKRPM:1:1101:15613:1876 1:N:0:GCCTTCGGGA+CCCACGATTT
GGTGACTGAGCATTGTCTTCCCTCCCAGGCAGCTCACAGTGTGCCACCATGGAGTTGGGGCCCCTAGAAGGTGGCTTACCTGGAGCTTCTTAACAGCGATGCTGACCCCGTGTGCCTCTACCACTTCTATGACCAGATGGA
+
***@***.***
<1==G1..<GH/
@M07996:142:000000000-LKRPM:1:1101:14827:1933 1:N:0:GCCTTCGGTA+CCAACGATTT
…On Fri, 19 Jul 2024 at 17:26, Júlia Mir Pedrol ***@***.***> wrote:
If you are using --overrepresented, the input reads to seqtk are under
<outdir>/preprocessing/cutadapt, could you doublecheck if the same reads
which contain Ns after seqtk, also contain these Ns after cutadapt and
not in the input raw fastq files?
Thanks for helping with this debugging :)
—
Reply to this email directly, view it on GitHub
<#162 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A2SZGSFDMENKHC3QCMDKL4DZNE43RAVCNFSM6AAAAABKLBD672VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGU2TANRXGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Alan Tracey
Bioinformatician
T +44 (0)1223 787297
***@***.*** ***@***.***>
The Dorothy Hodgkin Building
Babraham Research Campus
Cambridge CB22 3FH
United Kingdom
***@***.*** ***@***.***> | www.bit.bio
Follow us
<https://twitter.com/bitbio>
<https://www.linkedin.com/company/bitbioltd/>
[image: bit.bio] <http://www.bit.bio/>
Notice: This message is the property of Bit Bio Ltd and contains
information that may be confidential and/or privileged. If you are not the
intended recipient, you should not use, disclose or take any action based
on this message. If you have received this transmission in error, please
immediately contact the sender by return e-mail and delete this e-mail, and
any attachments, from any computer.
Alan Tracey
Bioinformatician
T +44 (0)1223 787297
***@***.***
The Dorothy Hodgkin Building
Babraham Research Campus
Cambridge CB22 3FH
United Kingdom
***@***.*** | www.bit.bio
Follow us
Notice: This message is the property of Bit Bio Ltd and contains information that may be confidential and/or privileged. If you are not the intended recipient, you should not use, disclose or take any action based on this message. If you have received this transmission in error, please immediately contact the sender by return e-mail and delete this e-mail, and any attachments, from any computer.
|
Description of the bug
Hi, I’ve just run crisprseq using the targeted pipeline with a read1.fastq.gz only. I heavily quality filtered the input reads, removing any reads containing N characters. In the output indels.csv, there are many cases of N characters being reported in the "pre_ins_nt", "ins_nt" and "post_ins_nt" columns. When I check these reads in the input fastq file, the reported N characters are [ACGT] characters with Q value > 30. For the handful of reads I’ve looked at with these reported N characters, the majority called insertion (normal ACGT sequence) can be found in the input sequence, further suggesting these N calls could be erroneous results. My data is confidential so I unfortunately cannot share it. However, I notice that in the test dataset output, there are N's reported in some of the insertion outcomes which don't occur in the input reads, eg M00724:1:000000000-DC7GJ:1:1102:19229:3583 in hCas9-TRAC-a_R*.fastq.gz - this has AGA-N-CAT.
Command used and terminal output
No response
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered: