Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

N characters introduced into *indels.csv #162

Open
alan-tracey opened this issue Jul 4, 2024 · 7 comments
Open

N characters introduced into *indels.csv #162

alan-tracey opened this issue Jul 4, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@alan-tracey
Copy link
Contributor

Description of the bug

Hi, I’ve just run crisprseq using the targeted pipeline with a read1.fastq.gz only. I heavily quality filtered the input reads, removing any reads containing N characters. In the output indels.csv, there are many cases of N characters being reported in the "pre_ins_nt", "ins_nt" and "post_ins_nt" columns. When I check these reads in the input fastq file, the reported N characters are [ACGT] characters with Q value > 30. For the handful of reads I’ve looked at with these reported N characters, the majority called insertion (normal ACGT sequence) can be found in the input sequence, further suggesting these N calls could be erroneous results. My data is confidential so I unfortunately cannot share it. However, I notice that in the test dataset output, there are N's reported in some of the insertion outcomes which don't occur in the input reads, eg M00724:1:000000000-DC7GJ:1:1102:19229:3583 in hCas9-TRAC-a_R*.fastq.gz - this has AGA-N-CAT.

Command used and terminal output

No response

Relevant files

No response

System information

No response

@alan-tracey alan-tracey added the bug Something isn't working label Jul 4, 2024
@mirpedrol
Copy link
Member

Hello @alan-tracey, thanks for reporting this.
I had a look at the hCas9-TRAC sample from the test data and in this case the masked bases are due to bad quality. Even if the original reads have good quality, we use pear to join R1 and R2 reads, this computes a new quality score based on the overlapping bases, if this base is not the same for R1 and R2, the new quality will be lower.
You can check this assembled fastq files from the output directory preprocessing/pear to make sure that this is the same that happens with your samples.

@alan-tracey
Copy link
Contributor Author

alan-tracey commented Jul 19, 2024 via email

@mirpedrol
Copy link
Member

Could you check if the Ns are actually added by seqtk? You can find the output files after this tool in preprocessing/seqtk. By default we are using the parameter -q 20 -L 80 -n N for seqtk, which should mask bases with a quality lower than 20, are you modifying these parameters, or running the pipeline with all the defaults?

@alan-tracey
Copy link
Contributor Author

alan-tracey commented Jul 19, 2024 via email

@mirpedrol
Copy link
Member

Is the quality of those Ns higher than 20?

@mirpedrol
Copy link
Member

If you are using --overrepresented, the input reads to seqtk are under <outdir>/preprocessing/cutadapt, could you doublecheck if the same reads which contain Ns after seqtk, also contain these Ns after cutadapt and not in the input raw fastq files?
Thanks for helping with this debugging :)

@alan-tracey
Copy link
Contributor Author

alan-tracey commented Jul 19, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants