diff --git a/docs/conda.rst b/docs/conda.rst index edd5d8cd..1cf44f84 100644 --- a/docs/conda.rst +++ b/docs/conda.rst @@ -128,16 +128,19 @@ dependency tree and come up with a solution that works to satisfy the entire set of specified requirements. We chose to split the conda environments in two: the **main** environment and the **R** -environment (see :ref:`conda-design-decisons`). These environments are +environment (see :ref:`conda-design-decisions`). These environments are described by both "strict" and "loose" files. By default we use the "strict" version, which pins all versions of all packages exactly. This is preferred wherever possible. However we also provide a "loose" version that is not specific about versions. The following table describes these files: ++----------------+--------------------------------+----------------------------------+ | strict version | loose version | used for | +================+================================+==================================+ | ``env.yml`` | ``include/requirements.txt`` | Main Snakefiles | ++----------------+--------------------------------+----------------------------------+ | ``env-r.yaml`` | ``include/requirements-r.txt`` | Downstream RNA-seq analysis in R | ++----------------+--------------------------------+----------------------------------+ When deploying new instances, use the ``--build-envs`` argument which will use the strict version. Or use the following commands in a deployed directory: diff --git a/docs/config.rst b/docs/config.rst index 4a7e41c8..649a3cab 100644 --- a/docs/config.rst +++ b/docs/config.rst @@ -2,8 +2,8 @@ .. _config: -Configuration details -===================== +Configuration +============= General configuration ~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/getting-started.rst b/docs/getting-started.rst index ada4d53f..e565493d 100644 --- a/docs/getting-started.rst +++ b/docs/getting-started.rst @@ -3,22 +3,14 @@ Getting started =============== -The main prerequisite for `lcdb-wf` is `conda -_`, with the `bioconda -`_. channel set up and the `mamba -`_ drop-in replacement for conda. +The main prerequisite for `lcdb-wf` is `conda _`, with the `bioconda `_. channel set up and the `mamba `_ drop-in replacement for conda installed. If this is new to you, please see :ref:`conda-envs`. .. note:: - `lcdb-wf` is tested and heavily used on Linux. - - It is likely to work on macOS as long as all relevant conda packages are - available for macOS -- though this is not tested. - - It will **not** work on Windows due to a general lack of support of Windows - in bioinformatics tools. + `lcdb-wf` is tested and heavily used on Linux. It is only supported on + Linux. .. _setup-proj: @@ -27,7 +19,7 @@ Setting up a project The general steps to use lcdb-wf in a new project are: -1. **Deploy:** download and run ``deploy.py`` +1. **Deploy:** download and run ``deploy.py`` to copy files into a project directory 2. **Configure:** set up samples table for experiments and edit configuration file 3. **Run:** activate environment and run the Snakemake file either locally or on a cluster @@ -35,13 +27,16 @@ The general steps to use lcdb-wf in a new project are: 1. Deploying lcdb-wf -------------------- +Using `lcdb-wf` starts with copying files to a project directory, or +"deploying". Unlike other tools you may have used, `lcdb-wf` is not actually installed per se. Rather, it is "deployed" by copying over relevant files from the `lcdb-wf` repository to your project directory. This includes Snakefiles, config files, and other infrastructure required to run, and excludes files like these docs -and testing files that are not necessary for an actual project. The reason to -use this script is so you end up with a cleaner project directory. +and testing files that are not necessary for an actual project. The reason is +to use this script is so you end up with a cleaner project directory, compared +to cloning the repo directly. This script also writes a file to the destination called ``.lcdb-wf-deployment.json``. It stores the timestamp and details about what @@ -53,8 +48,8 @@ There are a few ways of doing this. Option 1: Download and run the deployment script ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Note that you will not be able to run tests with this method, but it is likely -the most convenient method. +This is the most convenient method, although it does not allow running tests +locally. .. code-block:: bash @@ -63,7 +58,8 @@ the most convenient method. Run ``python deploy.py -h`` to see help. Be sure to use the ``--staging`` and ``--branch=$BRANCH`` arguments when using this method, which will clone the -repository to a location of your choosing. Once you deploy you can remove it. For example: +repository to a location of your choosing. Once you deploy you can remove the +script. For example: .. code-block:: bash @@ -78,6 +74,12 @@ repository to a location of your choosing. Once you deploy you can remove it. Fo # You can clean up the cloned copy if you want: # rm -rf /tmp/lcdb-wf-tmp +This will clone the full git repo to ``/tmp/lcdb-wf-tmp``, check out the master +branch (or whatever branch ``$BRANCH`` is set to), copy the files required for +an RNA-seq project over to ``analysis/project``, build the main conda +environment and the R environment, save the ``.lcdb-wf-deployment.json`` file +there, and then delete the temporary repo. + Option 2: Clone repo manually ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Clone a repo using git and check out the branch. Use this method for running @@ -140,9 +142,9 @@ and run the following: If all goes well, this should print a list of jobs to be run. -You can run locally, but this is NOT recommended. To run locally, choose the -number of CPUs you want to use with the ``-j`` argument as is standard for -Snakemake. +You can run locally, but this is NOT recommended for a typicaly RNA-seq +project. To run locally, choose the number of CPUs you want to use with the +``-j`` argument as is standard for Snakemake. .. warning:: @@ -157,18 +159,35 @@ Snakemake. # run locally (not recommended) snakemake --use-conda -j 8 -The recommended way is to run on a cluster. On NIH's Biowulf cluster, the way -to do this is to submit the wrapper script as a batch job: +The recommended way is to run on a cluster. + +To run on a cluster, you will need a `Snakemake profile +`_ for +your cluster that translates generic resource requirements into arguments for +your cluster's batch system. + +On NIH's Biowulf cluster, the profile can be found at +https://github.com/NIH-HPC/snakemake_profile. If you are not already using this for other Snakemake workflows, you can set it up the first time like this: + +1. Clone the profile to a location of your choosing, maybe + ``~/snakemake_profile`` +2. Set the environment variable ``LCDBWF_SNAKEMAKE_PROFILE``, perhaps in your + ``~/.bashrc`` file. + +Then back in your deployed and configured project, submit the wrapper script as +a batch job: .. code-block:: bash sbatch ../../include/WRAPPER_SLURM -and then monitor the various jobs that will be submitted on your behalf. See +This will submit Snakemake as a batch job, use the profile to translate +resources to cluster arguments and set default command-line arguments, and +submit the various jobs created by Snakemake to the cluster on your behalf. See :ref:`cluster` for more details on this. -Other clusters will need different configuration, but everything is standard -Snakemake. The Snakemake documentation on `cluster execution +Other clusters will need different configuration, but everything in `lcdb-wf` +is standard Snakemake. The Snakemake documentation on `cluster execution `_ and `cloud execution `_ can be diff --git a/docs/index.rst b/docs/index.rst index 8062f98b..064f30ab 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -17,13 +17,18 @@ Non-model organism? Custom gene annotations? Complicated regression models? Unconventional command-line arguments to tools? New tools to add to the workflow? No problem. -Tested automatically --------------------- -Every change to the code on GitHub triggers an automated test, the results of -which you can find at https://circleci.com/gh/lcdb/lcdb-wf. Each test sets the -system up from scratch, including installing all software, downloading example -data, and running everything up through the final results. This guarantees that -you can set up and test the code yourself. +Extensive downstream RNA-seq +---------------------------- +A comprehensive RMarkdown template, along with a custom R package, enables +sophisticated RNA-seq analysis that supports complex experimental designs and +many contrasts. + +Extenstive exploration of ChIP-seq peaks +---------------------------------------- +The ChIP-seq configuration supports multiple peak-callers as well as calling +peaks with many different parameter sets for each caller. Combined with +visualizaiton in track hubs (see below), this can identify the optimal +parameters for a given experiment. Track hubs ---------- @@ -44,11 +49,10 @@ a site to get lots of genomes you can use for running `fastq_screen`, and easily include arbitrary other genomes. They can then be automatically included in RNA-seq and ChIP-seq workflows. -This system is designed to allow customization as the config file -can be used to include arbitrary genomes whether local or on the web. -The `references` workflow need only be run once for all these genomes -to be created, with the `references_dir` being used as a centralized -repository that can be then used with all other workflows. +Arbitrary genomes can be used, whether local (e.g., customized with additional +genetic constructs) or on the web. The `references` workflow need only be run +once for all these genomes to be created, with the `references_dir` being used +as a centralized repository that can be then used with all other workflows. Integration with external data and figure-making ------------------------------------------------ @@ -59,6 +63,15 @@ If an upstream file changes (e.g., gene annotation), all dependent downstream jobs -- including figures -- will be updated so you can ensure that even complex analyses stay correct and up-to-date. +Tested automatically +-------------------- +Every change to the code on GitHub triggers an automated test, the results of +which you can find at https://circleci.com/gh/lcdb/lcdb-wf. Each test sets the +system up from scratch, including installing all software, downloading example +data, and running everything up through the final results. This guarantees that +you can set up and test the code yourself. + + All the advantages of Snakemake ------------------------------- @@ -78,7 +91,9 @@ Only run the required jobs ~~~~~~~~~~~~~~~~~~~~~~~~~~ New gene annotation? Snakemake tracks dependencies, so it will detect that the annotations changed. Only jobs that depend on that file at some point in their -dependency chain will be re-run and the independent files are untouched. +dependency chain will be re-run and the independent files are untouched. Adding +a new sample will leave untouched any output from samples that have already +run. Parallelization ~~~~~~~~~~~~~~~ diff --git a/docs/tests.rst b/docs/tests.rst index 7080b564..9b601a7d 100644 --- a/docs/tests.rst +++ b/docs/tests.rst @@ -33,12 +33,7 @@ This assumes you have set up the `bioconda channel We **highly recommend** using conda for isolating projects and for analysis reproducibility. If you are unfamiliar with conda, we provide a more detailed look -at: - -.. toctree:: - :maxdepth: 2 - - conda +at :ref:`conda-envs`. Activate the main env @@ -186,13 +181,3 @@ Exhaustive tests The file ``.circleci/config.yml`` configures all of the tests that are run on CircleCI. There's a lot of configuration happening there, but look for the entries that have ``./run_test.sh`` in them to see the commands that are run. - -Next steps ----------- - -Now that you have tested your installation of ``lcdb-wf`` you can learn about the -different workflows implemented here at the :ref:`workflows` page and see details -on configuration at :ref:`config`, before getting started on your analysis. - -In addition, :ref:`setup-proj` explains the process of deploying ``lcdb-wf`` -to a project directory. diff --git a/docs/toc.rst b/docs/toc.rst index 7ac85132..1180c8cd 100644 --- a/docs/toc.rst +++ b/docs/toc.rst @@ -2,12 +2,11 @@ Table of Contents ================= .. toctree:: - :maxdepth: 2 + :maxdepth: 3 index getting-started guide - tests workflows config references @@ -15,6 +14,8 @@ Table of Contents downstream-rnaseq chipseq integrative + conda + tests faqs changelog developers diff --git a/docs/workflows.rst b/docs/workflows.rst index 04c07b61..3ab1ec2d 100644 --- a/docs/workflows.rst +++ b/docs/workflows.rst @@ -42,6 +42,34 @@ Each workflow is driven by a ``Snakefile`` and is configured by plain text `_ format files (see :ref:`config` for much more on this). +Features common to workflows +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +In this section, we will take a higher-level look at the features common to +the primary analysis workflows. + +- The ``lib`` module is imported in each Snakefile, allowing various helper + functions to be used. + +- The config file is hard-coded to be ``config/config.yaml`` by default, but + a custom config can be specified at the command-line, using ``snakemake + --configfile ``. + +- The config file is loaded using ``lib.common.load_config``. This function + resolves various paths (especially the references config section) and checks + to see if the config is well-formatted. + +- The ``c`` object: To make it easier to work with the config, a `SeqConfig` + object is created. It needs that parsed config file as well as the patterns + file (see :ref:`patterns-and-targets` for more on this). The act of creating + this object reads the sample table, fills in the patterns with sample names, + creates a reference dictionary (see ``common.references_dict``) for easy + access to reference files, and for ChIP-seq, also fills in the filenames for + the configured peak-calling runs. This object, called ``c`` for convenience, + can be accessed to get all sort of information -- ``c.sampletable``, + ``c.config``, ``c.patterns``, ``c.targets``, and ``c.refdict`` are frequently + used in rules throughout the Snakefiles. + + Primary analysis workflows ~~~~~~~~~~~~~~~~~~~~~~~~~~ The primary analysis workflows are generally used for transforming raw data @@ -51,12 +79,11 @@ peaks or differentially bound chromatin regions. The primary analysis workflows are: -.. toctree:: - :maxdepth: 1 + - References + - RNA-seq + - ChIP-seq - references - rnaseq - chipseq +These are each described further in their respective sections. While the references workflow can be stand-alone, usually it is run as a by-product of running the RNA-seq or ChIP-seq workflows. Here we will @@ -95,8 +122,8 @@ comments that say `# [TEST SETTINGS]`; you can ignore these, and see cp -r workflows/rnaseq workflows/genome1-rnaseq cp -r workflows/rnaseq workflows/genome2-rnaseq - Now, downstream analyses can link to and utilize results from these individual - folders, while the whole project remains self-contained. + This way, downstream analyses can link to and utilize results from these + individual folders, while the whole project remains self-contained. Integrative analysis workflows ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -105,52 +132,14 @@ tie them together. The integrative analysis workflows are described in :ref:`integrative`: -.. toctree:: - :maxdepth: 2 +- Colocalization +- "External" +- Figures - integrative - -Features common to workflows -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In this section, we will take a higher-level look at the features common to -the primary analysis workflows. - -- There is some shared code across the multiple Snakefiles. For instance, - The directory ``../..`` is added to Python's path. This way, the ``../../lib`` - module can be found, and we can use the various helper functions there. This is - also simpler than providing a `setup.py` to install the helper functions. - -- The config file is hard-coded to be `config/config.yaml`. This allows the config file to be - in the `config` dir with other config files without having to be specified on - the command line, while also affording the user flexibility. For instance, a custom - config can be specified at the command-line, using ``snakemake - --configfile ``. - -- The config file is loaded using ``common.load_config``. This function resolves - various paths (especially the references config section) and checks to see - if the config is well-formatted. - -- To make it easier to work with the config, a `SeqConfig` object is created. It - needs that parsed config file as well as the patterns file (see - :ref:`patterns-and-targets` for more on this). The act of creating this object - reads the sample table, fills in the patterns with sample names, creates - a reference dictionary (see ``common.references_dict``) for easy access to - reference files, and for ChIP-seq, also fills in the filenames for the - configured peak-calling runs. This object, called ``c`` for convenience, can be - accessed to get all sort of information -- ``c.sampletable``, ``c.config``, - ``c.patterns``, ``c.targets``, and ``c.refdict`` are frequently used in rules - throughout the Snakefiles. +These are each described in more detail in their respective sections. Next Steps ~~~~~~~~~~ -Next we look at :ref:`config` for details on how to configure specific workflows, -before going into the implemented workflows: - -- Primary analysis workflows - - :ref:`references` - - :ref:`rnaseq` - - :ref:`chipseq` - -- :ref:`integrative` - +Next we look at :ref:`config` for details on how to configure specific +workflows.