Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no intro on how to use the group meta yaml file and the tool does not validate yaml properly #16

Open
byb121 opened this issue Jan 18, 2019 · 1 comment

Comments

@byb121
Copy link

byb121 commented Jan 18, 2019

When input sequencing files are in fastq format, how to use the yaml file is tricky. For example the paired fastq file names have to ended with _1.fq.gz and _2.fg.gz, but such converntions are not mentioned anywhere.

Our code does not validate the file properly either, a few cases below:

  1. If SM tag is missed in the yaml file, cgpmap runs without complains, but will ignore rg id in the yaml file and generate a random one.
  2. Identical fastq file names in the file will not trigger any error/warning.
  3. When a input file is missed in the file, it will not complain.

I think it'll be better if we have a flag option specificly for single-ended fastqs. Cgpmap will assume inputs are paired-ended, and complains if inputs are neither interleaved nor paired, and if it's really a single ended input, user will need to label them specificly. Currently it just went on with its own assumptions silently.

@keiranmraine
Copy link
Contributor

keiranmraine commented Jan 30, 2019

The docs exist in the underlying tool documentation but need to be linked:

https://github.com/cancerit/PCAP-core/wiki/File-Formats-groupinfo.yaml

For example the paired fastq file names have to ended with _1.fq.gz and _2.fg.gz

I think this could be handled better by modifying the yaml format slightly, this was added very late previously you had no ability to include any header information. Pushing the files into the readgroup records would allow explicit pairing, but work on the underlying calls to the aligner will be needed.

SM: sample
# the actual readgroups
READGRPS:
  ID: 9
    files:
     - fq_1_00001.fq.gz
     - fq_2_00001.fq.gz
    CN: centre
    DS: Please don't use multiline
    LB: Library_id
    PI: 500
    PL: FORCED TO UPPER
    PM: HiSeq-XTen
    PU: 1234_1
  ID: 10
    files:
     - fq_1.fq.gz
     - fq_2.fq.gz

The 3 issues are definite bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants