Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] config as PR for discussion #10

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions test/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#### General Settings ####
# location of system level settings
settings:
title: My Very Cool Project
Author: Bob

# path like settings
data: /data/bob/original_data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the path to fastqs? Or do indexes go here as well? I feel like this is too general of a setting, and that indexes, annotations, etc can just specify the full path. And maybe we can have a regexp for fastqs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the fastq regex can go into the controlling snake file


# conda environment names
python2: py2.7

# names to access specific envs by
env: HOME
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would env have multiple entries? I think most tools can go in the same env, with the exception of py2-only which you're specifying separately here. So really we just need a py2 env and a py3 env. Actually, once snakemake has per-rule envs this will be obsolete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure I would leave this out initially, I was trying to think of all possible settings. May come across some misc perl program or something that needs these envs, but would also be at the rule level.


#### Experiment Level Settings ####
# experiment level settings, settings that apply to all samples
exp.settings:

# Sample information relating sample specific settings to sample ids
sampleinfo: sample_metadata.csv

# it would be nice to be able define a setting here that applies to all
# samples, or define for each sample in the sampleinfo table case they are
# different.
fastq_suffix: '.fastq.gz'
# Need to some way to specify annotation to use, maybe here is not the best
# place.
annotation:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to handle cases where we want to try multiple annotations? For example, to compare a truncated gene model annotation with the full. Maybe the answer there is just symlink over the upstream files to a new analysis dir rather than increase complexity here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the big problem, and I have not come up with any creative solutions. For example the iterative junctions for splicing.

I guess the easier question is, how do we want to define an annotation that is used throughout all analysis. How to quickly switch between flybase releases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the iterative junctions are going to get tricky. I have some ideas about that but it needs more testing.

Currently swapping flybase versions should be as easy as swapping annotation filenames. I think the bigger question is whether or not we should support multiple "flavors" of annotations in a single workflow. For now I'm leaning towards just supporting a single one to keep things simple, and having downstream rules concatenate de novo junctions onto that single single annotation.

genic: /data/...
transcript: /data/....
intergenic: /data/...
# add modeling information here
models:
formula: ~ sex + tissue + time
# tell which columns in sample table should be treated like factors
factors:
- sex
- tissue
- time
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the annotations and models should go in workflow.rnaseq. Also I find myself using different models. This could be implemented by having multiple models and using jinja-templated R scripts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really care what we do here, mostly I just use this step to build an Rdata file for use in my analysis notebooks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how about just having an Rdata file as the designated endpoint, rather than trying to get fancy with script-building?

Though I've found it useful to have a first pass of MA plots and DEG counts as another kind of QC to put in reports ("did the experiment work?") before doing custom work. So maybe just supporting a single model is fine.

Anyway, I think this is an easy thing to work out later, no need to worry about it now.


#### Workflow Settings ####
# I think using a naming scheme that follow folder structure would be useful.
# For example: if there is a workflows folder then we could define workflow
# specific settings
workflows.qc:
# List pieces of the pipeline to run, (or not run may be better)
steps_to_run:
- fastqc
- rseqc
# or could have logical operators switches to change workflow behavior
trim: True

Copy link
Contributor Author

@daler daler May 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a default config could be created that has all parts that can be run, and then they can be commented out on an experiment-by-experiment basis. That way it's easy for the user to pick and choose which parts to run and increase discoverability of the rules

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking. I think I would like a 'run' parameter for each rule, but may not matter. What about mutually exclusive rules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about mutually exclusive rules. Maybe we should see what the set of likely-to-by-optional rules looks like, and decide then what the config should be. I'd imagine some rules shouldn't be optional (aligning; counting reads in features; sorting BAMs) so whatever config method should probably not allow them to be disabled.

I don't have this sort of thing in my existing pipelines so I don't have any use cases to think about -- can you give some examples of what kinds of rules would be disable-able?

workflows.align:
# define what software to use and optionally what version
aligner: 'tophat2=2.1.0'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrappers can have versions, but I think currently only at the git commit level rather than at the tool level. I like having the version specified here with the rest of the config so maybe we can find a way for this to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be icing on top, but should be low priority. Maybe could create a pre-parser that pulls out the version and updates the conda envs. Or mayber there is someway to hijack the wrapper system with some sort of commit id look-up table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, agreed on low priority, but really good ideas on how to make it work. That could be really powerful.

aggregated_output_dir: /data/...
report_output_dir: /data/...

workflow.rnaseq: ...

workflows.references: ...

#### Rule Specific Settings ####
# rule level settings again with naming based on folder structure if we need
# folder structure
rules.align.bowtie2:
# It would be nice to be able to have cluster settings with rule setting,
# can't think of a way to get this to work, probably just need a separate
# cluster config.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had a wrapper for the actual call to snakemake, we could extract this information and build a cluster config file on the fly so this info could remain here. That would be really convenient for configuration, but at the cost of extra complexity in the wrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth it

cluster:
threads: 16
mem: 60g
walltime: 8:00:00
# bowtie index prefix
index: /data/...
# Access to any parameters that need set
params:
# place to change the options
options: -p 16 -k 8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to play around with how to add this to the rules. Pretty sure I've done this before, I'll have to dig up how.


# place to change how files are named
aln_suffix: '.bt2.bam'
log_suffix: '.bt2.log'

# vim: sw=2 ts=2