This REANA reproducible analysis example provides a simple example how to run Dask workflows using Coffea. The example was adapted from Coffea Casa tutorials repository.
Making a research data analysis reproducible basically means to provide "runnable recipes" addressing (1) where is the input data, (2) what software was used to analyse the data, (3) which computing environments were used to run the software and (4) which computational workflow steps were taken to run the analysis. This will permit to instantiate the analysis on the computational cloud and run the analysis to obtain (5) output results.
In this example, we are using a single CMS open data set file
Run2012B_SingleMu.root
which is hosted at EOSPUBLIC XRootD server.
The analysis code consists of a single Python file called analysis.py
which
connects to a Dask cluster and then conducts the analysis and prints MET
histogram.
In order to be able to rerun the analysis even several years in the future, we need to "encapsulate the current compute environment". We shall achieve this by preparing a Docker container image for our analysis steps.
This example makes use of the Coffea platform image with the specific version 0.7.22. The container image can be found on Docker Hub at docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049.
The analysis workflow is simple and consists of a single command. We simply run
the script python analysis.py
to run the example. The command will then use
the Dask behind the scenes to possibly launch parallel computations. As a user,
we do not have to specify the computational graph ourselves; the Dask library
will take care of dispatching computations.
The example produces the following MET event-level histogram as an output.
There are two ways to execute this analysis example on REANA.
If you would like to simply launch this analysis example on the REANA instance at CERN and inspect its results using the web interface, please click on the following badge:
If you would like a step-by-step guide on how to use the REANA command-line client to launch this analysis example, please read on.
We start by creating a reana.yaml file describing the above analysis structure with its inputs, code, runtime environment, computational workflow steps and expected outputs:
inputs:
files:
- analysis.py
workflow:
type: serial
resources:
dask:
image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
specification:
steps:
- name: process
environment: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
commands:
- python analysis.py
outputs:
files:
- histogram.png
tests:
files:
- tests/log-messages.feature
- tests/workspace-files.feature
In this example we are using a simple Serial workflow engine to launch our Dask-based computations.
We can now install the REANA command-line client, run the analysis and download the resulting plots:
$ # create new virtual environment
$ virtualenv ~/.virtualenvs/reana
$ source ~/.virtualenvs/reana/bin/activate
$ # install REANA client
$ pip install reana-client
$ # connect to some REANA cloud instance
$ export REANA_SERVER_URL=https://reana.cern.ch/
$ export REANA_ACCESS_TOKEN=XXXXXXX
$ # create new workflow
$ reana-client create -n myanalysis
$ export REANA_WORKON=myanalysis
$ # upload input code, data and workflow to the workspace
$ reana-client upload
$ # start computational workflow
$ reana-client start
$ # ... should be finished in about 5 minutes
$ reana-client status
$ # list workspace files
$ reana-client ls
$ # download output results
$ reana-client download
Please see the REANA-Client
documentation for more detailed explanation of typical reana-client
usage
scenarios.