Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation update: Running the rnaseq pipeline on prokaryotic samples #1084

Open
MatthiasZepper opened this issue Oct 5, 2023 · 1 comment
Labels

Comments

@MatthiasZepper
Copy link
Member

Description of feature

The documentation of the rnaseq pipeline that refers to running the pipeline on prokaryotic samples is unfortunately completely outdated. It specifies the required settings for featureCounts, even though that tool has been superseded by salmon for transcript quantification about 10 pipeline releases ago!

On Slack, Marine Cambon has already kindly written up what is needed to run the more recent versions of the pipeline successfully with prokaryotic RNA-seq. However, somebody needs to update the pipeline documentation accordingly. Since this requires only Markdown edits, I think it is a suitable task for the Hackathon?

For some general recommendations on how to write good technical documentation, see the website of the Diátaxis framework.

@d4straub
Copy link
Contributor

d4straub commented Jan 24, 2025

As of now (3.18.0) the documentation didnt change much.
Also, I couldn't run the pipeline following these suggestions recently. I was using data for E. coli strain BW25113 (4522 genes, 4419 CDS, 121 exon) of the NCBI RefSeq annotation from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000750555.1/.

The following seems important:

  • only "exon" are measured, therefore all features of interest need to have an "exon"
  • there may not be any special signs or maybe even not space in column number 2 (and often is)
  • gene_id are neccessary for each "exon" entry

The simplest solution I found for now was using the annotation in gff format and convert & simplify with gffread. Additionally gene_id have to be added then. This can be achieved as below.

gffread genomic.gff --keep-exon-attrs -F -T --force-exons | sed '/gene_id/!s/transcript_id\([^;]*\)/&; \0/;/transcript_id.*transcript_id/s/transcript_id/gene_id/' > genomic_gffread_force-exons_geneid.gtf

A similar discussion of the problem can be found in slack https://nfcore.slack.com/archives/CE8SSJV3N/p1679577061847429?thread_ts=1677835193.447669&cid=CE8SSJV3N

This solution should be applicable broadly, but I am not sure, but if that is indeed helpful then it might make sense to add that info to the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Todo
Status: To do
Development

No branches or pull requests

3 participants