This software provides the Sanger NPG team's automation for analysing and internally archiving Illumina sequencing data on behalf of DNA Pipelines for their customers.
There are two main pipelines:
- data product and QC metric creation:
central
- internal archival of data products, metadata, QC metrics and logs:
post_qc_review
and the daemons which automatically start these pipelines.
Processing is performed as appropriate for the entire run, for each lane in the sequencing flowcell, or each tagged library (within a pool on the flowcell).
With this system, all of a pipeline's jobs for its steps are submitted for
execution to LSF
or wr
batch/job processing system as the pipeline is
initialised. As such, a submitted pipeline does not have an orchestration
script or daemon running: managing the runtime dependencies of jobs within an
instance of a pipeline is delegated to the batch/job processing system.
How is this done? The job representing the start point of a graph is submitted
to LSF
or wr
in a suspended state and is resumed once all other jobs have been
submitted thus ensuring that the execution starts only if all steps are
successfully submitted. If an error occurs at any point during job submissions,
all submitted jobs, apart from the start job, are killed.
Steps of each of the pipelines and dependencies between the steps are defined in JSON input files located in data/config_files directory. The files follow JSON Graph Format syntax. Individual pipeline steps are defined as graph nodes, dependencies between them as directed graph edges. If step B should be executed after step A finishes, step B is considered to be dependant on step A.
The graph represented by the input file should be a directed acyclic graph (DAG). Each graph node should have an id, which should be unique, and a label, which is the name of the pipeline step.
Parallelisation of processing may be performed at different levels within the DAG: some steps are appropriate for
- per run
- per lane
- per lane and tagged library, or per tagged library
- per tagged library
parallelisation.
JSON Graph Format (JGF) is relatively new, with little support for
visualization. Convert JGF to GML
Graph Modeling Language
format using a simple script supplied with this package, scripts/jgf2gml
.
Many graph visualization tools, for example
Cytoscape, support the GML format.
The processing is performed per sequencing run. Many different studies and sequencing assays for different "customers" may be performed on a single run. Unlike contemporary (2020s) sharable bioinformatics pipelines, the logic for informatics is tied closely to the business logic e.g. what aligner is required with what reference, whether human read separation is required, is determined per indexed library within a lane of sequencing and scheduled for work in parallel.
The information required for the logic is obtained from the upstream "LIMS" via a MLWH (Multi-LIMS warehouse) database and the run folder output by the sequencing instrument.
Processes data coming from Illumina sequencing instruments. It is labeled the "central" pipeline.
The input for an instance of the pipeline is the instrument output run folder (BCL and associated files) and LIMS information which drives appropriate processing.
The key data products are aligned or unaligned CRAM files and indexes. However per study (a LIMS datum) pipeline configuration allows for the creation of GATK gVCF files, or the running for external tool/pipeline e.g. ncov2012-artic-nf
Within this DAG there are two step which are key in producing the main data products:
p4_stage1_analysis
processes data at the lane level within a flowcell/run: includes conversion of instrument output (BCL files) to BAM format, demultiplexing of data within a lane to tagged libraries, alignment with any spiked phiX, (for some instrument types) detection of indel inducing fluidics bubbles and marking reads with fail bit, and (for some instrument types) detection and marking of sequencing adapter.seq_alignment
processes data at tagged library, or lane and tagged library, level: includes alignment to the target genome (or not), a naive human read filtering capability, splitting of human target data by autosome/allosome capability, (for some instrument types) removal of marked adapter pre-alignment and pasting post-alignment (so there is no loss of instrument basecalls or quality data), duplicate marking, and creation of standard sequencing metrics files.
Archives sequencing data (CRAM files) and other related artifacts e.g. index files. QC metrics. It is labeled the "post_qc_review" pipeline.
Log file - in the run folder (as in the current pipeline). Example:
/nfs/sf55/IL_seq_data/outgoing/path_to_runfolder/bin_npg_pipeline_central_25438_20180321-080455-2214166102.log
File with JSON serialization of definition objects - in the analysis directory
directory. Example:
/path_to_runfolder/bin_npg_pipeline_central_25438_20180321-080455-2214166102.log.json
File with saved commands hashed by function name, LSF job id and array index -
in the analysis directory. Example:
/path_to_runfolder/Data/Intensities/BAM_basecalls_20180321-075511/bin_npg_pipeline_central_25438_20180321-080455-2214166102.log.commands4jobs.json
This software relies heavily on the npg_tracking software to abstract information from the MLWH and instrument runfolder, and coordination of the state of the run.
This software integrates heavily with the npg_qc system for calculating and recording for internal display QC metrics for operational teams to assess the sequencing and upstream processes.
For the data processing intensive steps, p4_stage1_analysis
and
seq_alignment
, the p4 software is used to
provide disk IO minimised processing of many informatics tools in streaming data
flow DAGs.
Also, the npg_irods system is essential for the internal archival of data products.
If the same library is sequenced in different lanes of a flowcell, under certain conditions the pipeline will automatically merge all data for a library into a single end product. Spiked-in PhiX libraries data and unassigned to any tags data (tag zero) are not merged. The following scenarios trigger the merge:
-
NovaSeq Standard flowcell - a merge across all two or four lanes is performed.
-
Any flowcell run on a NovaSeqX instrument - if multiple lanes belong to the same pool, the data from individual libraries will be merged across those lanes. Thus the output of a NovaSeqX run might contain a mixture of merged and unmerged products.
If the data quality in a lane is poor, the lane should be excluded from the merge.
The --process_separately_lanes
pipeline option is used to list lanes like this.
Usually this option is used when running the analysis pipeline. The pipeline caches
the supplied lane numbers so that the archival pipeline can generate a consistent
with the analysis pipeline list of data products. The same relates to the
npg_run_is_deletable
script. The cached value is retrieved only if the
--process_separately_lanes
argument was not set when any of these scripts are
invoked.