Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DSC current technical summary and some engineering aims #3

Closed
gaow opened this issue Oct 1, 2020 · 2 comments
Closed

DSC current technical summary and some engineering aims #3

gaow opened this issue Oct 1, 2020 · 2 comments
Assignees

Comments

@gaow
Copy link

gaow commented Oct 1, 2020

The document below summarizes the main components of current DSC software design, and where I'd like to see improvement on first.

Introduction

DSC: Dynamic statistical comparisons

Goal

  • Compare performance for different computational methods for a specific problem

Typical workflow

  • Generate data $\rightarrow$ Run analysis $\rightarrow$ Performance evaluation

Challenge

  • Involves many pipelines with combinations of parameters
  • Evaluate & re-evaluate & re-re-evaluate ...

DSC implementation

Simple yaml-like syntax powered by a workflow system

In this document I'll show a very simple toy focusing on discussions relevant to the use of SoS.

Example

A simple DSC script: test.dsc

normal: R(x <- rnorm(n,mean = mu,sd = 1))
  mu: 0
  n: 100, 200
  $data: x
  $true_mean: mu
mean: R(y <- mean(x))
  x: $data
  $est_mean: y
sq_err: R(e <- (x - y)^2)
  x: $est_mean
  y: $true_mean
  $error: e

A simple DSC script: test.dsc (cont'd)

DSC:
  define:
    simulate: normal
    analyze: mean
    score: sq_err
  run: simulate * analyze * score

Run the benchmark

pip install dsc -U # if you have not installed DSC
dsc test.dsc --replicate 5
INFO: DSC script exported to test.html
INFO: Constructing DSC from test.dsc ...
INFO: Building DSC database ...
[#######] 7 steps processed (34 jobs completed)
INFO: DSC complete!
INFO: Elapsed time 7.481 seconds.

Get benchmark results

result <- dscrutils::dscquery('test', 
    targets = c('simulate', 'simulate.n', 'analyze', 
                'score', 'score.error'), verbose=F)
print(head(result))
  DSC simulate simulate.n analyze  score  score.error
1   1   normal        100    mean sq_err 1.185646e-02
2   1   normal        200    mean sq_err 1.263066e-03
3   2   normal        100    mean sq_err 9.423768e-04
4   2   normal        200    mean sq_err 5.501969e-07
5   3   normal        100    mean sq_err 1.217838e-04
6   3   normal        200    mean sq_err 2.245624e-04

Explore the results

aggregate(score.error ~ simulate.n + analyze + score, 
          result, mean)
  simulate.n analyze  score  score.error
1        100    mean sq_err 0.0046476927
2        200    mean sq_err 0.0004343789

SoS under the hood

Run DSC via SoS

dsc program will generate two SoS workflows

  • A workflows prepare to generate benchmark meta-data files
    • Essentially running simple Python script in SoS
    • SoS can help skip rerun if context is not changed
  • A workflows run to run the benchmark
    • "dynamic" input and output info from meta-data
    • for_each loops correspond to ordering of parameters

DSC will run these workflows using execute_workflow() function.

Obtain SoS scripts from DSC script

dsc test.dsc --debug
ls .sos/*.sos
.sos/test_prepare.sos  .sos/test_run.sos

These scripts are uploaded to github repo gaow/random-nbs (click me). Let's focus on test_run.sos for now and discuss its design. What can be done to improve it? Or should we revamp?

The use of dynamic targets and step dependencies are characteristic of the current implementation. They are related to DSC meta-data design (see below).

DSC meta-data design

DSC outputs: meta-data

cd test && ls *.*
test.conf.mpk  test.db  test.map.mpk

There are 3 files:

  • map.mpk: mapping between a substep's hash with a filename
  • conf.mpk: input and output file names for workflow substeps
  • db: meta-data for dscquery() function (irrelevant to SoS for now)

DSC outputs: benchmark data

find . -name "*.rds"
./mean/normal_1_mean_1.rds
./mean/normal_2_mean_1.rds
./normal/normal_2.rds
./normal/normal_1.rds
./sq_err/normal_2_mean_1_sq_err_1.rds
./sq_err/normal_1_mean_1_sq_err_1.rds

These are output of each workflow substep. Notice here I use a smaller benchmark by not running dsc with --replicate 5.

Structure of map.mpk

The basic idea is to first represent each substep as

  • substep:HASH:upstream:HASH:upstream:HASH

where HASH encodes everything for a "module" except for contents of module script, eg:

normal: R(x <- rnorm(n,mean = mu,sd = 1)) # not coded into HASH
  mu: 0 # coded into HASH
  n: 100, 200 # each will be a substep, coded into HASH
  $data: x # coded into HASH
  $true_mean: mu # coded into HASH

Then map them to nicer looking filenames with numbers as indices

Structure of map.mpk (cont'd)

import msgpack, yaml; print(yaml.dump(msgpack.unpackb(open('test/test.map.mpk', 'rb').read(), encoding = 'utf-8')))
__base_ids__:
  normal: {normal: 2}
  normal:mean: {mean: 1, normal: 2}
  normal:mean:sq_err: {mean: 1, normal: 2, sq_err: 1}
mean:e3f9ad83:normal:3fce637f: mean/normal_1_mean_1.rds
mean:e3f9ad83:normal:6702ef96: mean/normal_2_mean_1.rds
normal:3fce637f: normal/normal_1.rds
normal:6702ef96: normal/normal_2.rds
sq_err:cd547d28:normal:3fce637f:mean:e3f9ad83:normal:3fce637f: sq_err/normal_1_mean_1_sq_err_1.rds
sq_err:cd547d28:normal:6702ef96:mean:e3f9ad83:normal:6702ef96: sq_err/normal_2_mean_1_sq_err_1.rds

Each HASH is combination of all parameters (parameter name and values) of a "module instance". eg normal:6702ef96 may refer to the normal module with mu=0, n=100.

Structure of map.mpk (cont'd)

File dependencies can be figured out from HASH in map.mpk, eg:

sq_err:cd547d28:normal:3fce637f:mean:e3f9ad83:normal:3fce637f: 
        sq_err/normal_1_mean_1_sq_err_1.rds

tells us that file sq_err/normal_1_mean_1_sq_err_1.rds depends on:

  • normal:3fce637f which is normal/normal_1.rds
  • mean:e3f9ad83:normal:3fce637f which is mean/normal_1_mean_1.rds

Structure of conf.mpk

conf.mpk saves step dependency, input and output for every subworkflow.

print(yaml.dump(msgpack.unpackb(open('test/test.conf.mpk', 'rb').read(), encoding = 'utf-8')))
'1':
  mean:
    depends:
    - [normal, 1]
    input: [test/normal/normal_1.rds, test/normal/normal_2.rds]
    output: [test/mean/normal_1_mean_1.rds, test/mean/normal_2_mean_1.rds]
  normal:
    depends: []
    input: []
    output: [test/normal/normal_1.rds, test/normal/normal_2.rds]
  sq_err:
    depends:
    - [normal, 1]
    - [mean, 1]
    input: [test/normal/normal_1.rds, test/mean/normal_1_mean_1.rds, test/normal/normal_2.rds,
      test/mean/normal_2_mean_1.rds]
    output: [test/sq_err/normal_1_mean_1_sq_err_1.rds, test/sq_err/normal_2_mean_1_sq_err_1.rds]

Pros of current design

  • Substep output files are meanfully named
  • Current structure works reasonably well for dscquery() function
    • dscquery() function example see earlier in this document
    • It operates over the *.db file which uses the *.mpk files to build SQLite database

Cons of current design

  • Preparing for these databases takes long time for very large benchmarks
    • Although we might improve the codes to build them
  • Difficult to merge two separate DSC runs from different people
    • Need to rebuild filename mapping and rename numerous files
    • Seems a constraint we cannot overcome

Future work on meta-data

Stop using "HASH to numbered filename" mapping

  • Thus no dynamic targets

Find some way to save output as {step_name}_{substep_hash}

  • Possibly done within SoS

Figure out how to rebuild the SQLite database to work with dscquery()

  • Not sure how if previous step is done in SoS

The only issue left is that substep output filenames are no longer meaningful

  • But we might compensate by improving dscquery() to be able to load a particular example

DSC next steps

Some technical aims

Related to SoS

  • New data format: currently limited to RDS and pickle files
    • supporting only R and Python benchmarks
  • Robust large scale execution & faster query
    • involving millions of total substeps

Not related to SoS

  • Looped exection
    • eg, A -> B -> A -> B ...
    • more of how to translate DSC to SoS script
  • Benchmark sharing between different users: have to figure it out as we redesign the database
  • No group_by logic in DSC. We need this logic designed for the interface, and implemented for the SoS underneath it, as well as supporting it in queries.

Appendix

This (possibly obsolete) slide

https://github.com/gaow/random-nbs/tree/master/slides/20191201_DSC_SoS

To generate the PDF file, first download this repo (click me) then run:

./release --notebook /path/to/DSC_SoS.ipynb
@gaow
Copy link
Author

gaow commented Oct 1, 2020

The code in question for us to reimplement are:

  1. build_config_db function that generates the conf.mpk and map.mpk
  2. ResultDB class that takes map.mpk as input and generates the result *.db file that dscrquery uses.

@gaow
Copy link
Author

gaow commented Oct 3, 2020

More technical details are now at #6 . This project overview ticket will no longer be updated.

@gaow gaow closed this as completed Oct 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants