DSC current technical summary and some engineering aims #3

gaow · 2020-10-01T17:42:39Z

The document below summarizes the main components of current DSC software design, and where I'd like to see improvement on first.

Introduction

DSC: Dynamic statistical comparisons

Goal

Compare performance for different computational methods for a specific problem

Typical workflow

Generate data $\rightarrow$ Run analysis $\rightarrow$ Performance evaluation

Challenge

Involves many pipelines with combinations of parameters
Evaluate & re-evaluate & re-re-evaluate ...

DSC implementation

Simple yaml-like syntax powered by a workflow system

A prototype: https://stephenslab.github.io/dsc-wiki
Benchmark configuration translated to SoS workflow language (click me)
Benchmark meta-data can be queried over using dscquery() an R function

In this document I'll show a very simple toy focusing on discussions relevant to the use of SoS.

Example

A simple DSC script: `test.dsc`

normal: R(x <- rnorm(n,mean = mu,sd = 1))
  mu: 0
  n: 100, 200
  $data: x
  $true_mean: mu
mean: R(y <- mean(x))
  x: $data
  $est_mean: y
sq_err: R(e <- (x - y)^2)
  x: $est_mean
  y: $true_mean
  $error: e

A simple DSC script: `test.dsc` (cont'd)

DSC:
  define:
    simulate: normal
    analyze: mean
    score: sq_err
  run: simulate * analyze * score

Run the benchmark

pip install dsc -U # if you have not installed DSC
dsc test.dsc --replicate 5

INFO: DSC script exported to test.html
INFO: Constructing DSC from test.dsc ...
INFO: Building DSC database ...
[#######] 7 steps processed (34 jobs completed)
INFO: DSC complete!
INFO: Elapsed time 7.481 seconds.

Get benchmark results

result <- dscrutils::dscquery('test', 
    targets = c('simulate', 'simulate.n', 'analyze', 
                'score', 'score.error'), verbose=F)
print(head(result))

  DSC simulate simulate.n analyze  score  score.error
1   1   normal        100    mean sq_err 1.185646e-02
2   1   normal        200    mean sq_err 1.263066e-03
3   2   normal        100    mean sq_err 9.423768e-04
4   2   normal        200    mean sq_err 5.501969e-07
5   3   normal        100    mean sq_err 1.217838e-04
6   3   normal        200    mean sq_err 2.245624e-04

Explore the results

aggregate(score.error ~ simulate.n + analyze + score, 
          result, mean)

  simulate.n analyze  score  score.error
1        100    mean sq_err 0.0046476927
2        200    mean sq_err 0.0004343789

SoS under the hood

Run DSC via SoS

dsc program will generate two SoS workflows

A workflows prepare to generate benchmark meta-data files
- Essentially running simple Python script in SoS
- SoS can help skip rerun if context is not changed
A workflows run to run the benchmark
- "dynamic" input and output info from meta-data
- for_each loops correspond to ordering of parameters

DSC will run these workflows using execute_workflow() function.

Obtain SoS scripts from DSC script

dsc test.dsc --debug
ls .sos/*.sos

.sos/test_prepare.sos  .sos/test_run.sos

These scripts are uploaded to github repo gaow/random-nbs (click me). Let's focus on test_run.sos for now and discuss its design. What can be done to improve it? Or should we revamp?

The use of dynamic targets and step dependencies are characteristic of the current implementation. They are related to DSC meta-data design (see below).

DSC meta-data design

DSC outputs: meta-data

cd test && ls *.*

test.conf.mpk  test.db  test.map.mpk

There are 3 files:

map.mpk: mapping between a substep's hash with a filename
conf.mpk: input and output file names for workflow substeps
db: meta-data for dscquery() function (irrelevant to SoS for now)

DSC outputs: benchmark data

find . -name "*.rds"

./mean/normal_1_mean_1.rds
./mean/normal_2_mean_1.rds
./normal/normal_2.rds
./normal/normal_1.rds
./sq_err/normal_2_mean_1_sq_err_1.rds
./sq_err/normal_1_mean_1_sq_err_1.rds

These are output of each workflow substep. Notice here I use a smaller benchmark by not running dsc with --replicate 5.

Structure of `map.mpk`

The basic idea is to first represent each substep as

substep:HASH:upstream:HASH:upstream:HASH

where HASH encodes everything for a "module" except for contents of module script, eg:

normal: R(x <- rnorm(n,mean = mu,sd = 1)) # not coded into HASH
  mu: 0 # coded into HASH
  n: 100, 200 # each will be a substep, coded into HASH
  $data: x # coded into HASH
  $true_mean: mu # coded into HASH

Then map them to nicer looking filenames with numbers as indices

Structure of `map.mpk` (cont'd)

import msgpack, yaml; print(yaml.dump(msgpack.unpackb(open('test/test.map.mpk', 'rb').read(), encoding = 'utf-8')))

__base_ids__:
  normal: {normal: 2}
  normal:mean: {mean: 1, normal: 2}
  normal:mean:sq_err: {mean: 1, normal: 2, sq_err: 1}
mean:e3f9ad83:normal:3fce637f: mean/normal_1_mean_1.rds
mean:e3f9ad83:normal:6702ef96: mean/normal_2_mean_1.rds
normal:3fce637f: normal/normal_1.rds
normal:6702ef96: normal/normal_2.rds
sq_err:cd547d28:normal:3fce637f:mean:e3f9ad83:normal:3fce637f: sq_err/normal_1_mean_1_sq_err_1.rds
sq_err:cd547d28:normal:6702ef96:mean:e3f9ad83:normal:6702ef96: sq_err/normal_2_mean_1_sq_err_1.rds

Each HASH is combination of all parameters (parameter name and values) of a "module instance". eg normal:6702ef96 may refer to the normal module with mu=0, n=100.

Structure of `map.mpk (cont'd)`

File dependencies can be figured out from HASH in map.mpk, eg:

sq_err:cd547d28:normal:3fce637f:mean:e3f9ad83:normal:3fce637f: 
        sq_err/normal_1_mean_1_sq_err_1.rds

tells us that file sq_err/normal_1_mean_1_sq_err_1.rds depends on:

normal:3fce637f which is normal/normal_1.rds
mean:e3f9ad83:normal:3fce637f which is mean/normal_1_mean_1.rds

Structure of `conf.mpk`

conf.mpk saves step dependency, input and output for every subworkflow.

print(yaml.dump(msgpack.unpackb(open('test/test.conf.mpk', 'rb').read(), encoding = 'utf-8')))

'1':
  mean:
    depends:
    - [normal, 1]
    input: [test/normal/normal_1.rds, test/normal/normal_2.rds]
    output: [test/mean/normal_1_mean_1.rds, test/mean/normal_2_mean_1.rds]
  normal:
    depends: []
    input: []
    output: [test/normal/normal_1.rds, test/normal/normal_2.rds]
  sq_err:
    depends:
    - [normal, 1]
    - [mean, 1]
    input: [test/normal/normal_1.rds, test/mean/normal_1_mean_1.rds, test/normal/normal_2.rds,
      test/mean/normal_2_mean_1.rds]
    output: [test/sq_err/normal_1_mean_1_sq_err_1.rds, test/sq_err/normal_2_mean_1_sq_err_1.rds]

Pros of current design

Substep output files are meanfully named
- Making it easier to explore results (click me for an example)
Current structure works reasonably well for dscquery() function
- dscquery() function example see earlier in this document
- It operates over the *.db file which uses the *.mpk files to build SQLite database

Cons of current design

Preparing for these databases takes long time for very large benchmarks
- Although we might improve the codes to build them
Difficult to merge two separate DSC runs from different people
- Need to rebuild filename mapping and rename numerous files
- Seems a constraint we cannot overcome

Future work on meta-data

Stop using "HASH to numbered filename" mapping

Thus no dynamic targets

Find some way to save output as {step_name}_{substep_hash}

Possibly done within SoS

Figure out how to rebuild the SQLite database to work with dscquery()

Not sure how if previous step is done in SoS

The only issue left is that substep output filenames are no longer meaningful

But we might compensate by improving dscquery() to be able to load a particular example

DSC next steps

Some technical aims

Related to SoS

New data format: currently limited to RDS and pickle files
- supporting only R and Python benchmarks
Robust large scale execution & faster query
- involving millions of total substeps

Not related to SoS

Looped exection
- eg, A -> B -> A -> B ...
- more of how to translate DSC to SoS script
Benchmark sharing between different users: have to figure it out as we redesign the database
No group_by logic in DSC. We need this logic designed for the interface, and implemented for the SoS underneath it, as well as supporting it in queries.

Appendix

This (possibly obsolete) slide

https://github.com/gaow/random-nbs/tree/master/slides/20191201_DSC_SoS

To generate the PDF file, first download this repo (click me) then run:

./release --notebook /path/to/DSC_SoS.ipynb

The text was updated successfully, but these errors were encountered:

gaow · 2020-10-01T19:57:43Z

The code in question for us to reimplement are:

build_config_db function that generates the conf.mpk and map.mpk
ResultDB class that takes map.mpk as input and generates the result *.db file that dscrquery uses.

gaow · 2020-10-03T01:21:50Z

More technical details are now at #6 . This project overview ticket will no longer be updated.

gaow assigned junyanj1 and lmxy0212 Oct 1, 2020

gaow closed this as completed Oct 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DSC current technical summary and some engineering aims #3

DSC current technical summary and some engineering aims #3

gaow commented Oct 1, 2020 •

edited

Loading

gaow commented Oct 1, 2020

gaow commented Oct 3, 2020

DSC current technical summary and some engineering aims #3

DSC current technical summary and some engineering aims #3

Comments

gaow commented Oct 1, 2020 • edited Loading

Introduction

DSC: Dynamic statistical comparisons

DSC implementation

Example

A simple DSC script: test.dsc

A simple DSC script: test.dsc (cont'd)

Run the benchmark

Get benchmark results

Explore the results

SoS under the hood

Run DSC via SoS

Obtain SoS scripts from DSC script

DSC meta-data design

DSC outputs: meta-data

DSC outputs: benchmark data

Structure of map.mpk

Structure of map.mpk (cont'd)

Structure of map.mpk (cont'd)

Structure of conf.mpk

Pros of current design

Cons of current design

Future work on meta-data

DSC next steps

Some technical aims

Appendix

This (possibly obsolete) slide

gaow commented Oct 1, 2020

gaow commented Oct 3, 2020

gaow commented Oct 1, 2020 •

edited

Loading

A simple DSC script: `test.dsc`

A simple DSC script: `test.dsc` (cont'd)

Structure of `map.mpk`

Structure of `map.mpk` (cont'd)

Structure of `map.mpk (cont'd)`

Structure of `conf.mpk`