Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta info database reimplementation #6

Open
gaow opened this issue Oct 3, 2020 · 1 comment
Open

Meta info database reimplementation #6

gaow opened this issue Oct 3, 2020 · 1 comment
Assignees

Comments

@gaow
Copy link

gaow commented Oct 3, 2020

Improve DSC meta information database

Please first install DSC from the development repo:

pip install git+git://github.com/cumc/dsc -U

Problem overview

We use this toy benchmark as an example,

dsc first_investigation.dsc

There will then be two folders in the directory you run the command:

- dsc_result
- .sos

Inside .sos folder there are several files

# Generated by DSC
dsc_result.cfg.pkl
dsc_result.io.meta.pkl
dsc_result.io.pkl
# Generated by SoS as it executes the DSC benchmark
step_signatures.db  
transcript.txt  
workflow_signatures.db

These pkl files generated by DSC has information extracted from the current *.dsc script that is being executed.

Inside dsc_result folder, apart from some folders that contains intermediate results, there are two files:

# Generated by DSC
dsc_result.map.mpk 
dsc_result.db  

dsc_result.map.mpk is meant to preserve information from multiple runs of DSC (@BoPeng: In SoS terminology, a module instance is a "substep"). Everytime a DSC command runs, dsc_result.map.mpk should be updated and not re-written. dsc_result.map.mpk is a key-value (dictionaries) data-bases saved in msgpack format.

dsc_result.db is in pickle format but just with a (arbitary) db file extension. It contains information of current DSC run.

Relationship among these files

  • dsc_result.cfg.pkl, dsc_result.io.meta.pkl = generated_from(dsc_script)
  • dsc_result.map.mpk, dsc_result.io.pkl = generated_from(dsc_result.cfg.pkl, dsc_result.io.meta.pkl) via this function
  • dsc_result.db = generated_from(dsc_result.cfg.pkl, dsc_result.io.meta.pkl, dsc_result.map.mpk) via this class

Current task

Let's start from reimplementing dsc_result.map.mpk. The goal is to still take dsc_result.cfg.pkl, dsc_result.io.meta.pkl as input, but efficiently update dsc_result.map.mpk consolidating with info from previous runs and generate dsc_result.io.pkl for the current run.

  1. It is updated, and not rewritten, at different DSC runs
  2. Database from different users can easily be merged

Input data explained

.sos/dsc_result.io.meta.pkl

import pickle
pickle.load(open('.sos/dsc_result.io.meta.pkl','rb'))
{1: {'normal': ['normal', 1], 'mean': ['mean', 1], 'abs_err': ['abs_err', 1]},
 2: {'normal': ['normal', 1], 'mean': ['mean', 1], 'sq_err': ['sq_err', 2]},
 3: {'normal': ['normal', 1],
  'median': ['median', 3],
  'abs_err': ['abs_err', 3]},
 4: {'normal': ['normal', 1],
  'median': ['median', 3],
  'sq_err': ['sq_err', 4]},
 5: {'t': ['t', 5], 'mean': ['mean', 5], 'abs_err': ['abs_err', 5]},
 6: {'t': ['t', 5], 'mean': ['mean', 5], 'sq_err': ['sq_err', 6]},
 7: {'t': ['t', 5], 'median': ['median', 7], 'abs_err': ['abs_err', 7]},
 8: {'t': ['t', 5], 'median': ['median', 7], 'sq_err': ['sq_err', 8]}}
  • Each key is a pipeline ID, corresponding to a pipeline.
  • 'normal': ['normal', 1] means the normal module here is the same normal module as used in pipeline 1. Because as you can see the first 4 pipelines share the same normal module. This meta information tells us what modules are shared between pipelines.

.sos/dsc_result.cfg.pkl

pickle.load(open('.sos/dsc_result.cfg.pkl','rb'))
(('normal', 1),
              {('normal:3fce637f',): {'__pipeline_id__': 1,
                '__pipeline_name__': 'a_normal+a_mean+a_abs_err',
                '__module__': 'normal',
                '__out_vars__': ['data', 'true_mean'],
                'DSC_REPLICATE': 1,
                'n': 100,
                'mu': 0},
               '__input_output___': ([], ['normal:3fce637f']),
               '__ext__': 'rds'})
...
...
 (('abs_err', 1),
              {('abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f',
                'normal:3fce637f',
                'mean:e3f9ad83:normal:3fce637f'): {'__pipeline_id__': 1,
                '__pipeline_name__': 'a_normal+a_mean+a_abs_err',
                '__module__': 'abs_err',
                '__out_vars__': ['error']},
               '__input_output___': (['normal:3fce637f',
                 'mean:e3f9ad83:normal:3fce637f'],
                ['abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f']),
               '__ext__': 'rds'})
...

This file contains information for each module. It is where most information for updating the map.mpk comes from. Take ('normal', 1) for example:

  • ('normal', 1) means this is the normal module in the pipeline 1.
  • The key 'normal:3fce637f' is its unique ID. This is determined by its input and parameters, and the MD5SUM of the script for it. Any changes to input and parameters will result in a different ID. This is gauranteed. The key is relevant to building map.mpk database.
  • The value corresponding to 'normal:3fce637f' are parameter values. Bascially 'normal:3fce637f' is the HASH of these values and the MD5SUM of the script for the module (the script is not included here). These values are relevant to building db database.
  • '__input_output___': the first element is the input to this module, the 2nd element is output of this module
  • '__ext__': extension of the output file from this module. In current DSC, each module outputs one file. For R, it is rds, for Python it is pkl -- these files contain output variables in DSC. For Bash it is yaml (a meta file).

Now look at a more complicated ('abs_err', 1). The key ('abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f', 'normal:3fce637f', 'mean:e3f9ad83:normal:3fce637f'): has two components:

  • The first component, abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f is the module instance ID. It is made up of itself, and all its dependency modules (its input). Because abs_err takes input from normal and mean. And mean takes input from normal.
  • The rest of it, normal:3fce637f and mean:e3f9ad83:normal:3fce637f are dependencies of abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f.

dsc_result/dsc_result.map.mpk

This is the database we'll reimplement

import msgpack
msgpack.unpack(open('dsc_result/dsc_result.map.mpk','rb'), encoding='utf-8')
{'normal:3fce637f': 'normal/normal_1.rds',
 'mean:e3f9ad83:normal:3fce637f': 'mean/normal_1_mean_1.rds',
 'abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f': 'abs_err/normal_1_mean_1_abs_err_1.rds',
 'sq_err:cd547d28:normal:3fce637f:mean:e3f9ad83:normal:3fce637f': 'sq_err/normal_1_mean_1_sq_err_1.rds',
 'median:45c94289:normal:3fce637f': 'median/normal_1_median_1.rds',
 'abs_err:0acdbf79:normal:3fce637f:median:45c94289:normal:3fce637f': 'abs_err/normal_1_median_1_abs_err_1.rds',
 'sq_err:cd547d28:normal:3fce637f:median:45c94289:normal:3fce637f': 'sq_err/normal_1_median_1_sq_err_1.rds',
 't:52a5d4d3': 't/t_1.rds',
 'mean:e3f9ad83:t:52a5d4d3': 'mean/t_1_mean_1.rds',
 'abs_err:0acdbf79:t:52a5d4d3:mean:e3f9ad83:t:52a5d4d3': 'abs_err/t_1_mean_1_abs_err_1.rds',
 'sq_err:cd547d28:t:52a5d4d3:mean:e3f9ad83:t:52a5d4d3': 'sq_err/t_1_mean_1_sq_err_1.rds',
 'median:45c94289:t:52a5d4d3': 'median/t_1_median_1.rds',
 'abs_err:0acdbf79:t:52a5d4d3:median:45c94289:t:52a5d4d3': 'abs_err/t_1_median_1_abs_err_1.rds',
 'sq_err:cd547d28:t:52a5d4d3:median:45c94289:t:52a5d4d3': 'sq_err/t_1_median_1_sq_err_1.rds',
 '__base_ids__': {'normal': {'normal': 1},
  'normal:mean': {'normal': 1, 'mean': 1},
  'normal:mean:abs_err': {'normal': 1, 'mean': 1, 'abs_err': 1},
  'normal:mean:sq_err': {'normal': 1, 'mean': 1, 'sq_err': 1},
  'normal:median': {'normal': 1, 'median': 1},
  'normal:median:abs_err': {'normal': 1, 'median': 1, 'abs_err': 1},
  'normal:median:sq_err': {'normal': 1, 'median': 1, 'sq_err': 1},
  't': {'t': 1},
  't:mean': {'t': 1, 'mean': 1},
  't:mean:abs_err': {'t': 1, 'mean': 1, 'abs_err': 1},
  't:mean:sq_err': {'t': 1, 'mean': 1, 'sq_err': 1},
  't:median': {'t': 1, 'median': 1},
  't:median:abs_err': {'t': 1, 'median': 1, 'abs_err': 1},
  't:median:sq_err': {'t': 1, 'median': 1, 'sq_err': 1}}}

The main content is very simple: 'mean:e3f9ad83:t:52a5d4d3': 'mean/t_1_mean_1.rds' one key corresponding to one unique file name to be saved on disk. However, what's difficult is to efficiently figure out what the file name should be. Take 'mean:e3f9ad83:t:52a5d4d3' for example. It is a mean module taking a t module as input. So it should end up in a mean/ folder, with t_??_mean_?? file name indicating that the pipeline so far has executed t then followed by mean. But we want to assign unique ?? number such that they have a relationship with the module. For example, t:52a5d4d3 will always correspond to t_1. So the file names are easy to read because when we see t_1 we know they are all from the same t module.

Currently my implementation is very simple: everything is written to a file with contents above. When new information needs to be added to it at a next DSC run:

  1. The entire file dsc_result/dsc_result.map.mpk will be loaded
  2. Based on input file dsc_result.cfg.pkl, we see which modules are not in this map database already. If they are here, then return the filename value it corresponds to. Otherwise have to figure out a unique file name for it. This is currently achieved by using a __base_ids__ that keeps track of the numbering up to now and add to it for the new input to generate new file names. I'll not explain it in detail because this design is inefficient and we should do a better job with a new implementation

.sos/dsc_result.io.pkl

pickle.load(open('.sos/dsc_result.io.pkl','rb'))
OrderedDict([('1',
              OrderedDict([('normal',
                            OrderedDict([('input', []),
                                         ('output',
                                          ['dsc_result/normal/normal_1.rds']),
                                         ('depends', [])])),
                           ('mean',
                            OrderedDict([('input',
                                          ['dsc_result/normal/normal_1.rds']),
                                         ('output',
                                          ['dsc_result/mean/normal_1_mean_1.rds']),
                                         ('depends', [('normal', 1)])])),
                           ('abs_err',
                            OrderedDict([('input',
                                          ['dsc_result/normal/normal_1.rds',
                                           'dsc_result/mean/normal_1_mean_1.rds']),
                                         ('output',
                                          ['dsc_result/abs_err/normal_1_mean_1_abs_err_1.rds']),
                                         ('depends',
                                          [('normal', 1), ('mean', 1)])]))])),
             ('2',
              OrderedDict([('normal', ('1', 'normal')),
                           ('mean', ('1', 'mean')),
                           ('sq_err',
                            OrderedDict([('input',
                                          ['dsc_result/normal/normal_1.rds',
                                           'dsc_result/mean/normal_1_mean_1.rds']),
                                         ('output',
                                          ['dsc_result/sq_err/normal_1_mean_1_sq_err_1.rds']),
                                         ('depends',
                                          [('normal', 1), ('mean', 1)])]))])),
  • Basic format is very clean. For each pipeline, and for each module, what is the input should it take (input), and what are the modules these input are generated from (depends), and what are the output from this module.
  • There is also a shortcut:
('2',
              OrderedDict([('normal', ('1', 'normal')),

Here, in pipeline 2, the normal module is the same as the normal module in pipeline '1', so instead of writing the same information again, it uses this shortcut to keep the information.

Information in .sos/dsc_result.io.pkl combines that from cfg.pkl which has the pipeline info and dependency info, then uses filenames it found in map.mpk to get meaningful filenames.

Proposed new implementation for dsc_result/dsc_result.map.mpk

We should have a database like this:

module file depends parameters
normal:3fce637f normal/normal_1.rds None ...
sq_err:cd547d28:t:52a5d4d3:median:45c94289:t:52a5d4d3 sq_err/t_1_median_1_sq_err_1.rds (t:52a5d4d3, median:45c94289:t:52a5d4d3) ...

where parameters saves all parameters from the cfg.pkl file. The file column should have the unique, human-readable filename generated efficiently. This is perhaps the most difficult feature to implement.

This database should support efficient row addition and deletions.

@gaow
Copy link
Author

gaow commented Oct 3, 2020

This task should be easier than it sounds ... I went extra length to explain what we have above, to make sure all details are covered. But essentially this boils down to

  1. Choosing a reliable, portable database implementation for dsc_result.map.mpk file: efficient query, addition, deletion, supports complex data types in the parameter column, and easy to merge multiple such databases
  2. With the new database in place, an efficient algorithm to figure out file names in the file column.

@junyanj1 I'm assigning you to this task but we can all discuss here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants