Memory limitations: option to store simulations on disk #255

dschoenig · 2021-02-26T17:43:40Z

Addresses #188

…equires package ff.

dschoenig · 2021-02-26T18:09:21Z

For testing, this repurposes the example in the vignette to produce a simulation matrix with 1 billion elements.

testData = createData(sampleSize = 5e5, overdispersion = 1.5, family = poisson())
fittedModel <- glm(observedResponse ~  Environment1 , family = "poisson", data = testData)
res <- simulateResiduals(fittedModel, 2000, method = "PIT", bigData = TRUE)
plot(res)

Computing the residuals took ~260s for me on a laptop with SSD.

florianhartig · 2021-06-24T11:33:29Z

Hi Daniel,

I'm just checking open issues and wanted to say first of all many thanks, I appreciate this PR, and I had of course seen it and but didn't reply yet because I hadn't thought about it in more detail.

Making DHARMa fit for big data is certainly useful, just that I'm not quite sure if doing this via the storage is the way to go. An alternative may simply to force grouped residuals, i.e. implement the option in recalculateResiduals direction in simulateResiduals.

The reason is that I'm not quite sure how helpful it will be to have 1 billon residuals, if all the tests / plots crash downstream.

Any thoughts on that?

Cheers,
Florian

dschoenig · 2021-06-30T20:50:48Z

Hi Florian,

No worries, I needed the ff implementation for a project so it wasn't too much work to put some of this into DHARMa.

I completely agree with you that there's no point in having a large number of residuals if most of the functionality of the package will be compromised.

I personally think there may be two viable alternatives, aggregation (as with grouped residuals) or sub-sampling. I think both would be possible with the functionality already provided by recalculateResiduals (group and sel arguments, respectively), is that correct?

Both would probably require setting (internally) some kind of permitted maximum number of residuals that would still allow to use all / most of the tests, etc. that the package provides.

When using aggregation, it would then probably come down to finding a sensible default grouping. I think the model predictions could be ranked, and based on their rank they could be assigned to equal-size bins, the number of bins being the maximum number of residuals permitted.

For a sub-sampling approach, one could work in a similar fashion (ranking and binning), but choosing considerably fewer bins (e.g. a 20th of the permitted maximum number of residuals) and then sampling enough observations in each bin (20 if keeping with the example) to reach the target number.

I'm also slightly in favor of aggregation, but it may be worthwhile testing both approaches.

And just some additional thoughts in this context:

If the final number of residuals is still too large for a smooth scatter plot, one could switch to a gray-scale heatmap / 2D-histogram for the residuals vs predicted plot.
For the different tests, I assume they scale quite differently with the number of residuals, and I really wouldn't consider myself an experts when it comes to their respective properties. Still, it might be interesting to explore which tests could be bootstrapped or work with a smaller sub-sample of the residuals. I think Moran's I is a good example: Calculations rapidly become prohibitively slow above 100,000 observations, but smaller sub-samples / bootstrapping seem to be doing the trick in many cases.

bigData switch to work with many simulations / large datasets. Now r…

c062d7e

…equires package ff.

dschoenig changed the title ~~Memory limitations: store simulations on disk~~ Memory limitations: option to store simulations on disk Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory limitations: option to store simulations on disk #255

Memory limitations: option to store simulations on disk #255

dschoenig commented Feb 26, 2021

dschoenig commented Feb 26, 2021

florianhartig commented Jun 24, 2021

dschoenig commented Jun 30, 2021

Memory limitations: option to store simulations on disk #255

Are you sure you want to change the base?

Memory limitations: option to store simulations on disk #255

Conversation

dschoenig commented Feb 26, 2021

dschoenig commented Feb 26, 2021

florianhartig commented Jun 24, 2021

dschoenig commented Jun 30, 2021