Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to merge multiple pipelines automatically in query when pipeline parameters are involved #145

Closed
gaow opened this issue Jun 12, 2018 · 57 comments

Comments

@gaow
Copy link
Member

gaow commented Jun 12, 2018

The issue was reported by @jean997 over slack:

Lets say I have two tracks that get run AB and AC. Lets say that C has three possible parameter values and B only has 1. Right now if I do something like

dscout <- dscquery(dsc.outdir="results",
                   targets=c("A.A_params", "B.summary", "C.C_params", "C.summary"))

I will only get results for the first parameter value for C but if I do

dscout <- dscquery(dsc.outdir="results",
                   targets=c("A.A_params",  "C.C_params", "C.summary"))

I get all three. This makes sense but it is awkward because I have to merge data frames

My response:

This is a result of joining in tables by common key when constructing SQL query. But I suppose I can break it up smarter into multiple SQL queries more properly, and join separate tables, then merge the outcome internally, so that you'll not have to merge tables outside the query function.

@pcarbo
Copy link
Member

pcarbo commented Jun 12, 2018

@gaow Can you create a small DSC example that illustrates this behaviour so I can inspect the output?

@gaow
Copy link
Member Author

gaow commented Jun 12, 2018

@pcarbo sure! that'd also be the first thing I should do. I'll try to do this later today and if it is straightforward enough to fix I'll just fix it. Otherwise I'll upload the example here, or add to our test, for you to inspect it and discuss desired behavior.

@pcarbo
Copy link
Member

pcarbo commented Jun 12, 2018

@gaow Thanks. It would be great if I could first understand this example before you attempt to fix it.

@gaow
Copy link
Member Author

gaow commented Jun 17, 2018

@pcarbo It has not been straightforward so cannot fix it earlier this week. I'm still working on it. I'll skip irrelevant details and discuss what I've got so far and see how to finalize it. Currently, after running separate queries I end up with this table A:

   DSC  aa.n  bb.n
0    1   1   NA
1    1   2   NA
2    1   3   NA
3    1   NA   1
4    1   NA   2
5    1   NA   3

which should further be processed to this table B:

   DSC  aa.n  bb.n
0    1   1   1
1    1   2   2
2    1   3   3

to present it more concisely. I currently struggle with a reliable implementation to go from table A to B. Previously I relied on SQL group by on none-NA columns (this case the DSC column) which would cause the issue this ticket discusses. So now I need a better way to consolidate them. Do you know a good way to do this in R?

@gaow
Copy link
Member Author

gaow commented Jun 17, 2018

Another question is when the two parameters are different in length (as described here), or even in nature (see below), should we be merging it at all? For example:

   DSC  aa.n  bb.p
0    1   1   NA
1    1   2   NA
2    1   3   NA
3    1   NA   0.1

does it make sense to present it as:

   DSC  aa.n  bb.p
0    1   1   0.1
1    1   2   NA
2    1   3   NA

? or it makes more sense to leave it without merging? Seems keeping as is makes sense.

I think it is a difficult question to answer without knowing the context of the DSC -- for example for me to compare two fine mapping tools I'd like to set --max-causal parameter for both of them to say 1,2,3 and have them merged to compare 1 vs 1, 2 vs 2 and 3 vs 3 -- makes sense to merge. But if I set the other one to 4,5,6 then it does not make sense to merge, eg:

   DSC  method1.max_n  method2.max_n
0    1   1   4
1    1   2   5
2    1   3   6

From the engineering point of view I wonder if it should be a post processing step of dsc-query result done at R level, considering whether or not to merge is context dependent and we cannot easily tell it by guessing from parameter names or values.

@stephens999
Copy link
Contributor

stephens999 commented Jun 18, 2018 via email

@gaow
Copy link
Member Author

gaow commented Jun 18, 2018

But let me say my understanding/suggestion is that the output of dsc_query
should be one row per
pipeline instance.

Yes this is current behavior. But sometimes it is convenient to further merge the outcome for a more compact and easier to use table -- only if we know how to do this appropriately! Sure let's talk about it on Wednesday.

@stephens999
Copy link
Contributor

stephens999 commented Jun 18, 2018 via email

@pcarbo
Copy link
Member

pcarbo commented Jun 26, 2018

@gaow Like Matthew I'm confused by your example---I don't see how dsc could automatically know when to merge, and when not to merge. It seems that this would require an understanding how the parameters are used in the code, which would require a human-level understanding of the benchmark.

@jean997
Copy link

jean997 commented Jun 26, 2018

Sorry I am just looking at this discussion. @gaow, @pcarbo, @stephens999 I think the problem is actually that we don't get one row per pipeline run. That would be a fine solution.

I made an example illustrating the behavior. I am attaching the dsc file
example.dsc.txt and an R script
get_results.R.txt that shows the behavior of dscquery (file extensions changed to txt to appease git hub). The R file has the results of each command included.

Basically the problem is this. I run a pipeline simulate*(A, B)*summary where method B has three possible parameters it can run with. I run three replicates. Extracting method A only results I expect a data frame with three rows -- this is observed. Extracting method B only results I expect a data frame with 9 rows (one row per parameter value per replicate) -- this is also observed. Extracting both, the ideal solution would be to also have nine rows. This might be complicated however and it doesn't easily extend to when both methods have multiple parameters. One solution described by @gaow would be to have 12 rows with NAs for method A on nine of them and NAs for method B on three of them. However, what is observed is a data frame with three rows and only the first parameter value for method B is extracted. Another also possible solution would be to go to wide format where the results for method B give three columns each corresponding to a different parameter value.

@gaow
Copy link
Member Author

gaow commented Jun 26, 2018

Thank you @jean997 for clarifying it! Would you mind updating to the current master and see the new behavior? Take a step back, I realized fixing the issue you initially reported involves reverting my patch that tries automatically merging results from different runs -- which I believe was initially also your feature request so reverting it would upset you, however -- that should give the result you describe below:

One solution described by @gaow would be to have 12 rows with NAs for method A on nine of them and NAs for method B on three of them.

Would you verify if it is the case?

And ideally, yes we want to be able to reliably achieve this:

Another also possible solution would be to go to wide format where the results for method B give three columns each corresponding to a different parameter value.

But i'm having troubles with this, because it is logically difficult to properly determine and merge those NA's. Checking NA status is not enough, as shown by my examples above, and I tend to agree with @pcarbo that it requires human-level understanding.

Since for simple case such as yours, the merging would make sense, I'm now more inclined to add a step in the R utility to post process it depending on the user's need.

RE the NA.rds problem you reported at the end of your example -- that I believe should be a bit to fix at R level: @pcarbo basically we need to not let dscquery to load files when the element in the data-frame is NA ...

@gaow
Copy link
Member Author

gaow commented Jun 26, 2018

@jean997 I fixed the NA.rds issue and ran your example:

> library(dscrutils)
> #Try to extract both but don't end up with all parameter values for huber_mean
> dscout <- dscquery(dsc.outdir="results", 
+                    targets=c("mean.est_mean",
+                              "huber_mean.k",
+                              "huber_mean.est_mean"))
Running shell command:
dsc-query results -o /tmp/RtmpGM0tlN/file381765d2de68.csv -f --target mean.est_mean huber_mean.k huber_mean.est_mean --rds omit
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Loading dsc-query output from CSV file.
Reading DSC outputs:
 - mean.est_mean: extracted atomic values
 - huber_mean.est_mean: extracted atomic values
> dscout
   DSC mean.est_mean huber_mean.est_mean huber_mean.k
1    1   0.008447484                  NA           NA
2    2   0.075343752                  NA           NA
3    3   0.024550986                  NA           NA
4    1            NA         0.009774250          0.5
5    2            NA         0.061133496          0.5
6    3            NA         0.025887638          0.5
7    1            NA         0.008999247          1.5
8    2            NA         0.075455984          1.5
9    3            NA         0.019175379          1.5
10   1            NA         0.008447484          5.0
11   2            NA         0.075343752          5.0
12   3            NA         0.024550986          5.0

You'll need to install the current master of dscrutils to test it out, but I'll make a new release once we resolve this issue!

@pcarbo Your previous assessment is pretty much what happens, but still to get a more concrete idea you can update DSC and run Jean's example and see for yourself. This is a pretty good example that Jean puts together. The DSC part is just dsc example.dsc.txt. You should get my results above, and I argue that we should not try automatically merging them but rather provide an R function that makes it easy to do after query when the researcher deems there is a need. For example it would be very appealing to merge results below:

> dscout
   DSC mean.est_mean huber_mean.est_mean 
1    1   0.008447484                  NA           
2    2   0.075343752                  NA         
3    3   0.024550986                  NA           
4    1            NA         0.009774250         
5    2            NA         0.061133496          
6    3            NA         0.025887638          

@pcarbo
Copy link
Member

pcarbo commented Jun 27, 2018

@gaow I confirm that the last two queries now work after updating SoS and dsc.

@jean997 Thanks for providing this example! Very helpful to have examples for ironing out issues with dsc.

@gaow
Copy link
Member Author

gaow commented Jun 27, 2018

Thanks @pcarbo for confirming it! To motivate my claim that we should provide some post-processing utility, I updated my post above with an example from Jean's data that we actually may want to merge. Posting it again here:

> dscout
   DSC mean.est_mean huber_mean.est_mean 
1    1   0.008447484                  NA           
2    2   0.075343752                  NA         
3    3   0.024550986                  NA           
4    1            NA         0.009774250         
5    2            NA         0.061133496          
6    3            NA         0.025887638 

is nicer to be presented (or post-processed to):

> dscout
   DSC mean.est_mean huber_mean.est_mean 
1    1   0.008447484                  0.009774250           
2    2   0.075343752                  0.061133496         
3    3   0.024550986                  0.025887638            

This was initially also requested by @jean997 via slack a while ago and I responded by doing this automatic merging assuming there is always equal number of NA in each missing block -- an assumption that has led to issues reported in this ticket. Now I think we all agree not to auto-merge for reasons discussed previously. But should we provide utility functions for post processing it like above, if users deem it proper? @jean997 do you have some utility functions in R that you use for this sort of merger before I responded to your slack request?

@stephens999
Copy link
Contributor

stephens999 commented Jun 27, 2018 via email

@gaow
Copy link
Member Author

gaow commented Jun 27, 2018

Thank you @stephens999 ! The reasons I did not advertise too much about groups are #130 and #134. But I think the big picture is that those support will be provided one way or another. At this stage, groups function, though not mature, can be useful for simple cases.

@jean997 indeed for this particular example, what Matthew has proposed can be achieved already dynamically at query level, that is:

dscout <- dscquery(dsc.outdir="results", targets="mean_methods.est_mean mean_methods.k mean_methods.est_mean", groups="mean_methods: mean, huber_mean")

and you get:

> dscout
   DSC mean_methods mean_methods.est_mean mean_methods.k
1    1         mean           0.008447484             NA
2    2         mean           0.075343752             NA
3    3         mean           0.024550986             NA
4    1   huber_mean           0.009774250            0.5
5    2   huber_mean           0.061133496            0.5
6    3   huber_mean           0.025887638            0.5
7    1   huber_mean           0.008999247            1.5
8    2   huber_mean           0.075455984            1.5
9    3   huber_mean           0.019175379            1.5
10   1   huber_mean           0.008447484            5.0
11   2   huber_mean           0.075343752            5.0
12   3   huber_mean           0.024550986            5.0

Not exactly what you asked for, but is viable alternative. Of course we might need to document it better, which I just did:

  groups: Definition of module groups. For example, ‘groups =
          c("method: mean, median", "score: abs_err, sqrt_err")’ will
          dynamically create module groups ‘method’ and ‘score’ even if
          they have not previously been defined when running DSC.

along with minor format adjustment. Please update dscrutils and try it out. Is that good enough to you? Does it work on your more complicated examples?

@pcarbo
Copy link
Member

pcarbo commented Jun 27, 2018

@gaow @stephens999 @jean997 I agree that groups is the best way to resolve this. Here's my implementation of Matthew's suggestion:

simulate: R(dat <- rnorm(n = 1000,mean = 0,sd = 1) +
                   rbinom(n = 1000,size = 1,prob = 0.05) *
                   rexp(n = 1000,rate = 3))
  $dat: dat

unbiased_mean: R(m <- mean(x))
  x: $dat
  $est_mean: m

huber_mean: R(library(MASS); m <- huber(x,k = k)$mu)
  x: $dat
  k: 0.5, 1.5, 5
  $est_mean: m

sq_error: R(sq_error <- est^2)
  $sq_err: sq_error
  est: $est_mean

DSC:
  define: 
    mean: unbiased_mean, huber_mean
  run: simulate * mean * sq_error
  replicate: 3

For example,

dscrutils::dscquery(dsc.outdir = "example2",
  targets = c("mean.est_mean","huber_mean.k","sq_error.sq_err"))

returns the following data frame:

   DSC       mean mean.est_mean mean.k sq_error.sq_err
1    1 huber_mean      0.008447     NA      0.00007136
2    2 huber_mean      0.075344     NA      0.00567668
3    3 huber_mean      0.024551     NA      0.00060275
4    1 huber_mean      0.009774    0.5      0.00009554
5    2 huber_mean      0.061133    0.5      0.00373730
6    3 huber_mean      0.025888    0.5      0.00067017
7    1 huber_mean      0.008999    1.5      0.00008099
8    2 huber_mean      0.075456    1.5      0.00569361
9    3 huber_mean      0.019175    1.5      0.00036770
10   1 huber_mean      0.008447    5.0      0.00007136
11   2 huber_mean      0.075344    5.0      0.00567668
12   3 huber_mean      0.024551    5.0      0.00060275

@jean997
Copy link

jean997 commented Jun 27, 2018

I think a group is clearly the right answer for this example. This came up for me initially in a context where groups would not have been the right choice because methods A and B produce totally different summaries.

I think @stephens999 is right that that is probably a situation where the user needs to solve their own problem but dsc also shouldn't produce non-sensical output if it is provided with a poorly thought out query. I think a lot of users will do what I did first before they figure out that they need to query the results in two steps and then merge. One option would be to just have dscquery complain if it is asked for something that is too complicated. For example, I think there is an obvious way to merge if not more than one of the methods has multiple parameter values. For situations beyond that, I think it would be fine to produce an error and suggest that user try to retrieve the results in separate chunks, since the NA filled data frame is just as cumbersome to work with.

@pcarbo In your example above, should the first three rows have unbiased_mean in the mean column?

@pcarbo
Copy link
Member

pcarbo commented Jun 27, 2018

dsc also shouldn't produce non-sensical output if it is provided with a poorly thought out query.

I dunno how you would solve this---if you use a programming language to develop poorly thought-out code, you will get poor results! This is a human problem, not a programming language problem. :)

In your example above, should the first three rows have unbiased_mean in the mean column?

@jean997 You are absolutely right. @gaow Any idea why the "mean" column only shows huber_mean?

@jean997
Copy link

jean997 commented Jun 27, 2018

I just had a chance to re-run my example using the new version of dsc. For the most part got expected results, however, the last command gave something weird.

dscout <- dscquery(dsc.outdir="results",
+                    targets=c("mean.est_mean",
+                              "sq_error.sq_err"))
Running shell command:
dsc-query results -o /tmp/RtmphdtMHw/file3e6f2b5cd356.csv -f --target mean.est_mean sq_error.sq_err --rds omit
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Loading dsc-query output from CSV file.
Reading DSC outputs:
 - mean.est_mean: extracted atomic values
 - sq_error.sq_err: extracted atomic values
> dscout
   DSC mean.est_mean sq_error.sq_err
1    1   0.008447484    7.135998e-05
2    2   0.075343752    5.676681e-03
3    3   0.024550986    6.027509e-04
4    1            NA    9.553596e-05
5    2            NA    3.737304e-03
6    3            NA    6.701698e-04
7    1            NA    8.098644e-05
8    2            NA    5.693605e-03
9    3            NA    3.676952e-04
10   1            NA    7.135998e-05
11   2            NA    5.676681e-03
12   3            NA    6.027509e-04

I think the last nine rows of this data frame should not appear?

@pcarbo
Copy link
Member

pcarbo commented Jun 27, 2018

@jean997 Yes, this seems like another bug or undesirable outcome.

@stephens999
Copy link
Contributor

stephens999 commented Jun 30, 2018 via email

@gaow
Copy link
Member Author

gaow commented Jun 30, 2018

Yes exactly @stephens999 is correct. And this is when a row filtering is needed, via conditions:

dscout <- dscquery(dsc.outdir="results",
+                    targets=c("mean.est_mean",
+                              "sq_error.sq_err"), conditions="mean.est_mean not null")

(need to verify)

But I was mostly concerned that if we find this confusing in house we perhaps should do something about it. I think for this situation it might make sense to add automatically an additional column

mean
unbiased_mean
huber_mean
...

I did not find a chance to implement this during the week but will try to fixed this ticket the next couple of days.

@stephens999
Copy link
Contributor

stephens999 commented Jun 30, 2018 via email

@stephens999
Copy link
Contributor

stephens999 commented Jun 30, 2018 via email

@gaow
Copy link
Member Author

gaow commented Jun 30, 2018

Presumably that is SQL syntax?

Yes. I understand it is confusing. In particular R has the most diverse types of missing data. Anyways I think we should not dwell too much on that exact syntax to remove NA because post-processing in R, eg pipe to dplyr, is always more transparent to R users. I even think that conditions should not be advertised at all if most users are R based and knows how to do it in R.

My suggestion would be to try to make the logic clearer...

I think adding a column automatically whenever groups appear in the query should be made a default behavior whether there is missing data or not -- I think that's a proper fix and is generally solution, not to deal with this specific circumstance.

@gaow
Copy link
Member Author

gaow commented Jun 30, 2018

Okay, current behavior adjusted to:

dscrutils::dscquery(dsc.outdir = "test",  targets = c("mean.est_mean","huber_mean.k","sq_error.sq_err"))
   DSC          mean mean.est_mean mean.k sq_error.sq_err
1    1 unbiased_mean   0.008447484     NA    7.135998e-05
2    2 unbiased_mean   0.075343752     NA    5.676681e-03
3    3 unbiased_mean   0.024550986     NA    6.027509e-04
4    1    huber_mean   0.009774250    0.5    9.553596e-05
5    2    huber_mean   0.061133496    0.5    3.737304e-03
6    3    huber_mean   0.025887638    0.5    6.701698e-04
7    1    huber_mean   0.008999247    1.5    8.098644e-05
8    2    huber_mean   0.075455984    1.5    5.693605e-03
9    3    huber_mean   0.019175379    1.5    3.676952e-04
10   1    huber_mean   0.008447484    5.0    7.135998e-05
11   2    huber_mean   0.075343752    5.0    5.676681e-03
12   3    huber_mean   0.024550986    5.0    6.027509e-04

Notice here the change of huber_mean.k to mean.k. The "rule" is that as soon as a group name mean is involved, we will try to consolidate by groups for all columns involved, and add a group column (mean here) to reflect the group label.

I think making rules is not bad (in fact I personally feel that I personally make less rules than many software I use). I think what's important is consistency in behavior. In principle I would not propose, or tolerant solutions that lacks generality.

dscrutils::dscquery(dsc.outdir="test", targets=c("unbiased_mean.est_mean",  "huber_mean.est_mean", "sq_error.sq_err"))
   DSC unbiased_mean.est_mean huber_mean.est_mean sq_error.sq_err
1    1            0.008447484                  NA    7.135998e-05
2    2            0.075343752                  NA    5.676681e-03
3    3            0.024550986                  NA    6.027509e-04
4    1                     NA         0.009774250    9.553596e-05
5    2                     NA         0.061133496    3.737304e-03
6    3                     NA         0.025887638    6.701698e-04
7    1                     NA         0.008999247    8.098644e-05
8    2                     NA         0.075455984    5.693605e-03
9    3                     NA         0.019175379    3.676952e-04
10   1                     NA         0.008447484    7.135998e-05
11   2                     NA         0.075343752    5.676681e-03
12   3                     NA         0.024550986    6.027509e-04

Compare above to below: again introducing mean leads to automatic grouping for all columns involved.

dscrutils::dscquery(dsc.outdir="test", targets=c("unbiased_mean.est_mean",  "huber_mean.est_mean", "mean.k", "sq_error.sq_err"))
   DSC          mean mean.est_mean mean.k sq_error.sq_err
1    1 unbiased_mean   0.008447484     NA    7.135998e-05
2    2 unbiased_mean   0.075343752     NA    5.676681e-03
3    3 unbiased_mean   0.024550986     NA    6.027509e-04
4    1    huber_mean   0.009774250    0.5    9.553596e-05
5    2    huber_mean   0.061133496    0.5    3.737304e-03
6    3    huber_mean   0.025887638    0.5    6.701698e-04
7    1    huber_mean   0.008999247    1.5    8.098644e-05
8    2    huber_mean   0.075455984    1.5    5.693605e-03
9    3    huber_mean   0.019175379    1.5    3.676952e-04
10   1    huber_mean   0.008447484    5.0    7.135998e-05
11   2    huber_mean   0.075343752    5.0    5.676681e-03
12   3    huber_mean   0.024550986    5.0    6.027509e-04
dscrutils::dscquery(dsc.outdir="test", targets=c("huber_mean.k", "sq_error.sq_err"))
   DSC huber_mean.k sq_error.sq_err
1    1           NA    7.135998e-05
2    2           NA    5.676681e-03
3    3           NA    6.027509e-04
4    1          0.5    9.553596e-05
5    2          0.5    3.737304e-03
6    3          0.5    6.701698e-04
7    1          1.5    8.098644e-05
8    2          1.5    5.693605e-03
9    3          1.5    3.676952e-04
10   1          5.0    7.135998e-05
11   2          5.0    5.676681e-03
12   3          5.0    6.027509e-04

compare above to below: mean.k will add a mean column to the result, even if actually mean.k is unique only to huber_mean. This behavior should now be consistent with what I said above.

dscrutils::dscquery(dsc.outdir="test", targets=c("mean.k", "sq_error.sq_err"))
 - sq_error.sq_err: extracted atomic values
   DSC          mean mean.k sq_error.sq_err
1    1 unbiased_mean     NA    7.135998e-05
2    2 unbiased_mean     NA    5.676681e-03
3    3 unbiased_mean     NA    6.027509e-04
4    1    huber_mean    0.5    9.553596e-05
5    2    huber_mean    0.5    3.737304e-03
6    3    huber_mean    0.5    6.701698e-04
7    1    huber_mean    1.5    8.098644e-05
8    2    huber_mean    1.5    5.693605e-03
9    3    huber_mean    1.5    3.676952e-04
10   1    huber_mean    5.0    7.135998e-05
11   2    huber_mean    5.0    5.676681e-03
12   3    huber_mean    5.0    6.027509e-04

@pcarbo
Copy link
Member

pcarbo commented Jul 2, 2018

@gaow I don't have any fundamental issues with your proposal, but I do worry that it might be difficult to implement your rule in general, or explain it. Here is a simpler rule that I think is more intuitive, easier to implement, and easier to explain:

When target is x.y, where x is a module or module group, and y is an output or module parameter, return:

(1) An error when y is not defined in x, or in any modules in x, or any modules in the same group as x;

(2) NA when module x is run, or a module in group x is run, and y is not defined in the module;

(3) NA when module x is not run.

The output will not alwasy be as attractive or concise as it could be, but from the user point-of-view it is nice when all the targets you requested appear in the output table (unless it generates an error).

@gaow
Copy link
Member Author

gaow commented Jan 31, 2019

Wouldn't it be possible for DSC to distinguish between these two possibilities? There is a difference between "there is no output file" and "there is an output file and it contains the value NA".

In principle yes, but in practice we do not get to filter the table until we have the complete table. We will thus have to invent customized convention in the R code to distinguish these two types of NA. See here and here. But what do we use for another type of NA? Some crazy string characters?

It could be added as an argument, e.g. return_all = TRUE would return the outer join and return_all = FALSE would return only those rows with no missing data

It is a compromise but I'm always hesitated adding more options like this simply because we did not agree on the expected behavior ourselves. I am leaning towards filtering it even though it involves more work that it appears. I just want to make sure there is no bad side effects.

@jdblischak
Copy link
Member

If it's unclear what the best way forward is, then maybe we should just wait until we have more user feedback. I am satisfied to use na.omit() for now.

@pcarbo
Copy link
Member

pcarbo commented Jan 31, 2019

@jdblischak @gaow This is one of those situations where we didn't properly anticipate all the use cases when designing the internal data structures.

@stephens999
Copy link
Contributor

stephens999 commented Feb 1, 2019 via email

@gaow
Copy link
Member Author

gaow commented Feb 1, 2019

Is there still a conditions parameter that can be used to filter out
pipelines?

Yes and I think carefully crafted condition parameter can give clean results (although possibly not for this example because the "outer join" happens after condition filter). But more often than not we tend to use dscquery without condition constraints, then use R to post-process it, so that we only query it once. The condition query is perhaps only useful for very large benchmarks when it is too much computation to load every file generated for all conditions.

I don't understand the example.

The pipelines are:

  run:
    pipe_de: de * analyze * (auc, true_fdr)
    pipe_zeros: zeros * analyze * type_one_error

then the query targets are de analyze true_fdr. You see de and true_fdr are only relevant to the first pipeline, but analyze is relevant to both pipelines. As explained earlier, there is no concept of pipeline in queries. As a result, lines 36 to 45 are in fact analyze from the 2nd pipeline where de and true_fdr are missing.

@stephens999
Copy link
Contributor

stephens999 commented Feb 1, 2019 via email

@gaow
Copy link
Member Author

gaow commented Feb 1, 2019

Sorry in the ticket I truncated the output to highlight the problem. The complete output is here:

> df_auc <- dscquery("soneson2013/",
                  targets = c("de",
                              "analyze",
                              "auc.auc"))
Reading DSC outputs:
 - auc.auc: extracted atomic values
   DSC               de analyze           auc.auc
1    1    b_1250_0_data  nbpseq 0.703066420259366
2    1    b_1250_0_data    voom 0.749941929730694
3    1    b_1250_0_data     vst 0.745499375780275
4    1    b_1250_0_data   edger 0.707332049224184
5    1    b_1250_0_data   deseq 0.627637988228999
6    1   b_625_625_data  nbpseq 0.712647685269889
7    1   b_625_625_data    voom  0.77268378180479
8    1   b_625_625_data     vst  0.77547807031507
9    1   b_625_625_data   edger 0.718635069598904
10   1   b_625_625_data   deseq 0.630987136255991
11   1    b_4000_0_data  nbpseq 0.692115897167516
12   1    b_4000_0_data    voom 0.682802801166534
13   1    b_4000_0_data     vst 0.666099988306731
14   1    b_4000_0_data   edger 0.693014518555751
15   1    b_4000_0_data   deseq 0.650152209209948
16   1 b_2000_2000_data  nbpseq 0.778166273051953
17   1 b_2000_2000_data    voom 0.772014492946875
18   1 b_2000_2000_data     vst 0.775365827640311
19   1 b_2000_2000_data   edger 0.781900276909811
20   1 b_2000_2000_data   deseq 0.744707881632538
21   1   p_625_625_data  nbpseq 0.811233462789303
22   1   p_625_625_data    voom  0.86839306341039
23   1   p_625_625_data     vst 0.870691258756994
24   1   p_625_625_data   edger 0.818146471021465
25   1   p_625_625_data   deseq 0.757796341260436
26   1   s_625_625_data  nbpseq 0.677014888307802
27   1   s_625_625_data    voom 0.749693650385287
28   1   s_625_625_data     vst 0.751365541307759
29   1   s_625_625_data   edger 0.690230491068573
30   1   s_625_625_data   deseq 0.515177624697379
31   1   r_625_625_data  nbpseq 0.651902240517095
32   1   r_625_625_data    voom 0.719384218990249
33   1   r_625_625_data     vst 0.727683658476215
34   1   r_625_625_data   edger 0.666313853728294
35   1   r_625_625_data   deseq  0.50147431424293
36   1             <NA>   edger              <NA>
37   1             <NA>   deseq              <NA>
38   1             <NA>  nbpseq              <NA>
39   1             <NA>    voom              <NA>
40   1             <NA>     vst              <NA>
41   1             <NA>   edger              <NA>
42   1             <NA>   deseq              <NA>
43   1             <NA>  nbpseq              <NA>
44   1             <NA>    voom              <NA>
45   1             <NA>     vst              <NA>
46   1             <NA>   edger              <NA>
47   1             <NA>   deseq              <NA>
48   1             <NA>  nbpseq              <NA>
49   1             <NA>    voom              <NA>
50   1             <NA>     vst              <NA>
51   1             <NA>   edger              <NA>
52   1             <NA>   deseq              <NA>
53   1             <NA>  nbpseq              <NA>
54   1             <NA>    voom              <NA>
55   1             <NA>     vst              <NA>

Here is the DSC file. Two pipelines are executed. Lines before 35 are output for the pipeline_de.

Can you point me to the docs that describe the query behavior?

The "official" approach is to use dscquery in R, see docs here. It is a wrapper of this program. However I dont think these documentation is very helpful for this specific issue.

@stephens999
Copy link
Contributor

stephens999 commented Feb 2, 2019

thanks @gaow

you say two pipelines are executed, although of course it is many more than that...
(Here I am using the terminology that a pipeline is a sequence of modules)

In any case I have a question.

In the results I see there are 35 pipelines created by (de* analyze *auc) and 20 by
zeros * analyze * type_one_error

Where are the results for the other 35 pipelines created by
(de * analyze * true_fdr)
?

@gaow
Copy link
Member Author

gaow commented Feb 5, 2019

Where are the results for the other 35 pipelines created by
(de * analyze * true_fdr)

They are not available because they are not part of the query. When I query this:

library(dscrutils)
df_auc <- dscquery("soneson2013/",
                  targets = c("de",
                              "analyze",
                              "auc.auc",
                             "true_fdr.true_fdr"))

I get:

DSC de de.output.file analyze analyze.output.file auc.auc true_fdr.true_fdr
1 b_1250_0_data b_1250_0_data/b_1250_0_data_1 edger edger/b_1250_0_data_4_edger_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_2 edger edger/b_1250_0_data_5_edger_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_3 edger edger/b_1250_0_data_6_edger_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_1 deseq deseq/b_1250_0_data_4_deseq_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_2 deseq deseq/b_1250_0_data_5_deseq_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_3 deseq deseq/b_1250_0_data_6_deseq_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_1 nbpseq nbpseq/b_1250_0_data_4_nbpseq_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_2 nbpseq nbpseq/b_1250_0_data_5_nbpseq_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_3 nbpseq nbpseq/b_1250_0_data_6_nbpseq_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_1 voom voom/b_1250_0_data_4_voom_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_2 voom voom/b_1250_0_data_5_voom_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_3 voom voom/b_1250_0_data_6_voom_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_1 vst vst/b_1250_0_data_4_vst_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_2 vst vst/b_1250_0_data_5_vst_2 5 NA
1 b_1250_0_data b_1250_0_data/b_1250_0_data_3 vst vst/b_1250_0_data_6_vst_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_1 edger edger/b_625_625_data_4_edger_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_2 edger edger/b_625_625_data_5_edger_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_3 edger edger/b_625_625_data_6_edger_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_1 deseq deseq/b_625_625_data_4_deseq_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_2 deseq deseq/b_625_625_data_5_deseq_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_3 deseq deseq/b_625_625_data_6_deseq_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_1 nbpseq nbpseq/b_625_625_data_4_nbpseq_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_2 nbpseq nbpseq/b_625_625_data_5_nbpseq_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_3 nbpseq nbpseq/b_625_625_data_6_nbpseq_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_1 voom voom/b_625_625_data_4_voom_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_2 voom voom/b_625_625_data_5_voom_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_3 voom voom/b_625_625_data_6_voom_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_1 vst vst/b_625_625_data_4_vst_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_2 vst vst/b_625_625_data_5_vst_2 5 NA
1 b_625_625_data b_625_625_data/b_625_625_data_3 vst vst/b_625_625_data_6_vst_2 5 NA
1 b_4000_0_data b_4000_0_data/b_4000_0_data_1 edger edger/b_4000_0_data_4_edger_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_2 edger edger/b_4000_0_data_5_edger_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_3 edger edger/b_4000_0_data_6_edger_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_1 deseq deseq/b_4000_0_data_4_deseq_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_2 deseq deseq/b_4000_0_data_5_deseq_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_3 deseq deseq/b_4000_0_data_6_deseq_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_1 nbpseq nbpseq/b_4000_0_data_4_nbpseq_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_2 nbpseq nbpseq/b_4000_0_data_5_nbpseq_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_3 nbpseq nbpseq/b_4000_0_data_6_nbpseq_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_1 voom voom/b_4000_0_data_4_voom_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_2 voom voom/b_4000_0_data_5_voom_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_3 voom voom/b_4000_0_data_6_voom_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_1 vst vst/b_4000_0_data_4_vst_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_2 vst vst/b_4000_0_data_5_vst_2 NA 0
1 b_4000_0_data b_4000_0_data/b_4000_0_data_3 vst vst/b_4000_0_data_6_vst_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_1 edger edger/b_2000_2000_data_4_edger_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_2 edger edger/b_2000_2000_data_5_edger_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_3 edger edger/b_2000_2000_data_6_edger_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_1 deseq deseq/b_2000_2000_data_4_deseq_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_2 deseq deseq/b_2000_2000_data_5_deseq_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_3 deseq deseq/b_2000_2000_data_6_deseq_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_1 nbpseq nbpseq/b_2000_2000_data_4_nbpseq_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_2 nbpseq nbpseq/b_2000_2000_data_5_nbpseq_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_3 nbpseq nbpseq/b_2000_2000_data_6_nbpseq_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_1 voom voom/b_2000_2000_data_4_voom_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_2 voom voom/b_2000_2000_data_5_voom_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_3 voom voom/b_2000_2000_data_6_voom_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_1 vst vst/b_2000_2000_data_4_vst_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_2 vst vst/b_2000_2000_data_5_vst_2 NA 0
1 b_2000_2000_data b_2000_2000_data/b_2000_2000_data_3 vst vst/b_2000_2000_data_6_vst_2 NA 0

We can discuss it when we meet but here is a minimal working example:

mwe.dsc.txt

To run this example, first update DSC:

pip install dsc -U

then run it via:

dsc mwe.dsc.txt 

It should work without having to install any packages. The results can be queried using the R codes at the beginning of this post.

@gaow
Copy link
Member Author

gaow commented Feb 8, 2019

@stephens999 to answer again your question "Where are the results for the other 35 pipelines created by
(de * analyze * true_fdr)" let me clarify what your concern was as you explained to me in our in person discussion -- because if each row is a pipeline instance, these pipelines should also show up in the result as:

DSC de de.output.file analyze analyze.output.file auc.auc
1 b_4000_0_data b_4000_0_data/b_4000_0_data_1 edger edger/b_4000_0_data_1_edger_1 NA
1 b_4000_0_data b_4000_0_data/b_4000_0_data_2 edger edger/b_4000_0_data_2_edger_1 NA
1 b_4000_0_data b_4000_0_data/b_4000_0_data_3 edger edger/b_4000_0_data_3_edger_1 NA

where the last column is NA because the true_fdr score pipeline does not have the information.

I agree one would expect it. In fact you will see it if I modify this line:

pipelines = filter_sublist([get_sequence(tables, pipeline) for tables, pipeline in zip(valid_tables, self.pipelines)])

by just removing the function filter_sublist. What it does is that it removes pipeline instances contained in other pipeline instances, after accounting for query targets. For example, the query:

de analyze auc.auc

will completely overlap with pipeline de * analysis * auc and result in:

1 | b_4000_0_data  | edger  | 0.8

but in pipeline de * analysis * true_fdr it will only overlap de * analysis and result in:

1 | b_4000_0_data  | edger  | NA

Since the first pipeline instance 1 | b_4000_0_data | edger | 0.8 contains completely the 2nd 1 | b_4000_0_data | edger | NA

However, the story with overlapping zeros * analysis * type_one_error, the other pipeline in the DSC, is different. Since the only overlap with de analyze auc.auc is the analyze module it is not a complete pipeline instance. A pipeline instance should start with a module that does not have an upstream dependency. As a result, we have to complete the analyze module's upstream, which will include zeros module groups as one possibility. However since no output has been asked from zeros module group, the output will look like:

1 | NA  | edger  | NA

In sum the behavior of dsc-query is as intended at least in this example. although that filter_sublist call can be easily removed to make it truly one pipeline instance, I think it is neater to keep it as is, particularly since we are going to implement the new condition and filtering approach (discussion on slack) by that time this will not matter to the end user.

@gaow
Copy link
Member Author

gaow commented Feb 8, 2019

@jdblischak With this pull request you can change your query to:

library(dscrutils)
df_auc <- dscquery("soneson2013/",
                  targets = "auc.auc",
                  others = c("de", "analyze"),
                  omit.file.columns = TRUE)

and get what you actually wanted. Notice you can now use a new option omit.file.columns = TRUE to remove columns with just file names -- one would not need filenames of module instances unless they are meant to be loaded manually. I have merged this to master.

@stephens999 I used convention targets for what you suggested target_columns, and others for what you suggested others_column. Also I allowed for multiple targets columns with OR logic -- that is, a row will only be removed if all values in targets columns are missing values. This implementation is thus backwards compatible -- behavior of existing dscquery codes that other users have will not be changed. Here is the R documentation:

 targets: Query targets specified as character string separated by
          space, or a character vector, e.g., ‘targets = "simulate.n
          analyze score.error"’ and ‘targets =
          c("simulate.n","analyze","score.error")’ are equivalent.
          Using ‘paste’, eg ‘paste("simulate",c("n","p","df"),sep=".")’
          one can specify multiple properties from the same module.
          These will be the names of the columns in the returned data
          frame.

  others: Additional query items similarly specified as ‘targets’.
          Difference between ‘targets’ and `‘others’` is that the rows
          whose ‘targets’ columns containing all missing values will be
          removed, while ‘others’ columns will not have this impact.

@gaow
Copy link
Member Author

gaow commented Feb 10, 2019

I close this ticket now since the current behavior is what we've discussed. I'm still open to interface changes -- we can use separate tickets for it.

@gaow gaow closed this as completed Feb 10, 2019
@pcarbo
Copy link
Member

pcarbo commented Feb 11, 2019

@gaow I'm finally catching up on DSC. From help(dscquery):

Difference between ‘targets’ and ‘others’ is that the rows whose ‘targets’ columns containing all missing values will be removed, while ‘others’ columns will not have this impact.

First, this description should be given in the "target" argument.

Second, this behaviour doesn't seem like quite what we discussed; from my understanding, if any of the targets have a missing value, then the row should be removed or, equivalently, the row is only kept if all targets have non-missing values.

Third, can we change the names of "targets" and "others" arguments? In particular, "others" is not a great name for an argument. What about "target.cols" and "other.cols", for example?

@gaow
Copy link
Member Author

gaow commented Feb 11, 2019

First, this description should be given in the "target" argument.

Sure, please feel free to make changes.

Second, this behaviour doesn't seem like quite what we discussed; from my understanding, if any of the targets have a missing value, then the row should be removed or, equivalently, the row is only kept if all targets have non-missing values.

We have not discussed what to do with multiple targets. @stephens999 initially suggested limiting it to only one column. I used OR here because it will make it compatible with existing codes. We can certainly change it to using AND logic easily, if we all agree with it (and not bothered by backwards compatibility).

Third, can we change the names of "targets" and "others" arguments?

Again, there is backwards compatibility issue, which might impact me, Jean, Kaiqian, Yuxin and possibly John -- perhaps mostly me. I'm not against changing it particularly at this point; i'm only saying that we should be careful if we want to make interface changes.

@pcarbo
Copy link
Member

pcarbo commented Feb 11, 2019

I used OR here because it will make it compatible with existing codes.

Sounds like we weren't precise enough in our conversation. I'm open to the "or" behaviour, but my understanding in our discussion is that we want the "and" behaviour. Let's see what @stephens999 says.

I think if we name the arguments in a clever way, we can remain backward compatible.

@stephens999 Can we reopen this so that we remember to check in with Matthew about this?

@gaow gaow reopened this Feb 11, 2019
@stephens999
Copy link
Contributor

stephens999 commented Feb 12, 2019 via email

@pcarbo
Copy link
Member

pcarbo commented Feb 12, 2019

So I suggested to only allow one to avoid having to make a decision until we have a use case...

I'm okay with that. But to ensure backward compatibility, what we would have to do is add a new argument to dscquery that is different from what we had before; if we repurpose "targets", then we will lose backward compatibility.

Here is an idea that could be more user-friendly: Many functions in R that operate on vectors or matrices often have to deal with NAs as special cases, and there is often an argument to specify behaviour for missing values; e.g., na.rm in function mean. Maybe we could have an argument na.rm that accepts the name of one (or more) names specified in targets. I think this interface would be more familiar to many R users.

@gaow
Copy link
Member Author

gaow commented Feb 12, 2019

Maybe we could have an argument na.rm that accepts the name of one (or more) names specified in targets

I like this proposal! Are there any obvious limitations?

@stephens999
Copy link
Contributor

stephens999 commented Feb 12, 2019 via email

@gaow
Copy link
Member Author

gaow commented Feb 12, 2019

It kind of goes back to the difference between na and not run. ... intended behavior was to remove pipelines that did not run auc (or whatever is in Target) rather than removing pipeline that run it but produce na?

This is important to distinguish. Currently it is the intended behavior as you suggested above. It can still be the case when we implement the na.rm parameter but the interpretation of na.rm will be pipelines not run, not the conventional na.rm in R. And na.rm in our case will take column names anyways. Of course we can use other names.

So I suggested to only allow one to avoid having to make a decision until we have a use case...

@stephens999 this is currently already the case when targets has only one element. We just do not ban the case with more than one element now.

@pcarbo
Copy link
Member

pcarbo commented Feb 13, 2019

I like this proposal! Are there any obvious limitations?

@gaow None that I see.

I thought the intended behavior was to remove pipelines that did not run auc (or whatever is in Target) rather than removing pipeline that run it but produce na?

I thought we recognized that it is currently not possible to distinguish between the two based on how the DSC results are stored. My understanding is that sometimes DSC assigns NA to auc when tthe module generating a value for auc is not run.

@stephens999
Copy link
Contributor

stephens999 commented Feb 13, 2019 via email

@gaow
Copy link
Member Author

gaow commented Feb 13, 2019

Yes @stephens999 is correct. This is exactly what happened -- thanks to #167 all filtering are now in R and we can filter as many rounds as we want. We now filter twice: before extracting results, and after extracting values. The 'not run" filter only applies to the first round.

@pcarbo
Copy link
Member

pcarbo commented Feb 13, 2019

We now filter twice: before extracting results, and after extracting values.

@gaow And only the conditions are applied to filter results in the second round?

@gaow
Copy link
Member Author

gaow commented Feb 13, 2019

And only the conditions are applied to filter results in the second round?

Columns involved in conditions will be parsed and the program will decide which are the ones to filter in the first round and which to filter in the 2nd round. In the first round we filter by module parameters as well as for targets not run; in the 2nd round we filter by module output because at this point the output are already extracted and available.

@gaow gaow closed this as completed Nov 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants