Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dscquery takes long time to load data #203

Open
gaow opened this issue Nov 21, 2019 · 8 comments
Open

dscquery takes long time to load data #203

gaow opened this issue Nov 21, 2019 · 8 comments

Comments

@gaow
Copy link
Member

gaow commented Nov 21, 2019

@fmorgante complains about low performance of dscquery for the scale of DSC he's working on. @fmorgante it would be helpful if you can tell us:

  1. Link to your DSC (ideally a specific commit on github)
  2. the query you run
  3. the exact time it took for you to get the output table a note to myself: need to add to the code to report time elapsed
  4. the dimension of your output table
  5. a list of column names in your output table (need that to determine how many datasets were loaded)

Also since now we use RDS and PKL files to save output, we have to load the entire file to extract a specific quantity. This is a limitation that we cannot resolve unless we switch to other data storage solution as has long been discussed ..

@pcarbo
Copy link
Member

pcarbo commented Nov 21, 2019

@gaow As we discussed in person, I think the best way to approach this is to provide more information to the user about what dscquery is doing, and its progress. The interface should also provide better guidance to the user about how to use dscquery effectively.

gaow added a commit that referenced this issue Nov 21, 2019
@gaow
Copy link
Member Author

gaow commented Nov 21, 2019

@pcarbo I implemented a simple progress bar that shows percentage of tasks left and estimated time left:

> dscout <- dscquery(dsc.outdir = "dsc_result",
+                    targets    = c("simulate","analyze","score.error"))

dsc-query dsc_result -o /tmp/Rtmp4CNpsR/file63a3384d1981.csv --target "simulate analyze score.error" --force
INFO: Loading database ...
INFO: Running queries ...
INFO: Extraction complete!
Populating DSC output table of dimension 8 by 7.
- Loading targets [==========================] 100% eta:  0s

It might worth adding to it for internal diagnosis some monitoring stats such as CPU usage and disk i/o status, to see if there are other improvements we can make.

@pcarbo
Copy link
Member

pcarbo commented Nov 22, 2019

@gaow Very nice! That is certainly an improvement.

@gaow
Copy link
Member Author

gaow commented Nov 22, 2019

I'm testing out @fmorgante 's example myself. I noticed even running a regular query via the dsc-query command it takes very long time to complete -- roughly 5 minutes before it can even start loading the output. The resulting matrix to fill up is 720000 x 12 less than a million module instances which is reasonable scale for a benchmark. It seems we need to work out the performance issue here both for the Python program dsc-query and for the subsequent step to load in R.

@pcarbo
Copy link
Member

pcarbo commented Nov 22, 2019

@gaow One thing that might be helpful here would be to establish a "lower bound" on runtime. For instance, suppose I load matrices from hundreds of .rds files into R, and combine them in a "smart" way. How long does this take? How does this compare to doing this in dscquery? It shouldn't be hard to come up with a simple comparison.

@gaow
Copy link
Member Author

gaow commented Nov 22, 2019

dsc-query the Python program does more than just lumping together those rds files. And that took 5min for 720,000 rows which can be improved but not as bad as loading it which takes forever (>3hrs) before it got killed (as our users run out of patient). I'm using @fmorgante example and the progress bar to identify what code chunk is the culprit.

@gaow
Copy link
Member Author

gaow commented Nov 22, 2019

One think I notice is lines 571 to 582 of current dscquery.R that does something even before loading any data. That is, it hangs there before hitting the progress bar. If you unzip this dataset test.csv.gz
(generated from dsc-query command that took 5min), decompress it to test.csv, and run this query:

dsc <- dscrutils::dscquery("./", c("score.err", "score", "simulate", "fit_pred", "simulate.n_traits", "simulate.pve", "simulate.n_signal", "small_data.subsetN"), return.type='list', verbose=T,cache='test.csv')

since I added the cache=test.csv with my latest commit it should bring you right to those lines in question in that function, without having to have @fmorgante 's DSC outputs in place. You'll see it get stuck there under that two layer for loop before hitting the progress bar.

By "stuck" I'm talking about 2 hours as of now, and still counting!

Improving this could be a good first issue to someone with some computational background. Still it would be nice if you could verify it. I suspect it might be easier (or at least for me) to deal with it at the level of dsc-query Python program but I'm sure better R code can also help.

@pcarbo
Copy link
Member

pcarbo commented Dec 3, 2019

@gaow One bottleneck was read.csv; the performance instantly improved when I replaced it with fread from the data.table package.

There are some other places where the code is unnecessarily slow due to naive implementation. I will continue to work on this.

In any case, this is a very useful test case. (And it is the first time I've tried running dscquery on a table with 700,000 rows.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants