Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory limitations: option to store simulations on disk #255

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dschoenig
Copy link

Addresses #188

@dschoenig dschoenig changed the title Memory limitations: store simulations on disk Memory limitations: option to store simulations on disk Feb 26, 2021
@dschoenig
Copy link
Author

For testing, this repurposes the example in the vignette to produce a simulation matrix with 1 billion elements.

testData = createData(sampleSize = 5e5, overdispersion = 1.5, family = poisson())
fittedModel <- glm(observedResponse ~  Environment1 , family = "poisson", data = testData)
res <- simulateResiduals(fittedModel, 2000, method = "PIT", bigData = TRUE)
plot(res)

Computing the residuals took ~260s for me on a laptop with SSD.

@florianhartig
Copy link
Owner

Hi Daniel,

I'm just checking open issues and wanted to say first of all many thanks, I appreciate this PR, and I had of course seen it and but didn't reply yet because I hadn't thought about it in more detail.

Making DHARMa fit for big data is certainly useful, just that I'm not quite sure if doing this via the storage is the way to go. An alternative may simply to force grouped residuals, i.e. implement the option in recalculateResiduals direction in simulateResiduals.

The reason is that I'm not quite sure how helpful it will be to have 1 billon residuals, if all the tests / plots crash downstream.

Any thoughts on that?

Cheers,
Florian

@dschoenig
Copy link
Author

Hi Florian,

No worries, I needed the ff implementation for a project so it wasn't too much work to put some of this into DHARMa.

I completely agree with you that there's no point in having a large number of residuals if most of the functionality of the package will be compromised.

I personally think there may be two viable alternatives, aggregation (as with grouped residuals) or sub-sampling. I think both would be possible with the functionality already provided by recalculateResiduals (group and sel arguments, respectively), is that correct?

Both would probably require setting (internally) some kind of permitted maximum number of residuals that would still allow to use all / most of the tests, etc. that the package provides.

When using aggregation, it would then probably come down to finding a sensible default grouping. I think the model predictions could be ranked, and based on their rank they could be assigned to equal-size bins, the number of bins being the maximum number of residuals permitted.

For a sub-sampling approach, one could work in a similar fashion (ranking and binning), but choosing considerably fewer bins (e.g. a 20th of the permitted maximum number of residuals) and then sampling enough observations in each bin (20 if keeping with the example) to reach the target number.

I'm also slightly in favor of aggregation, but it may be worthwhile testing both approaches.

And just some additional thoughts in this context:

  • If the final number of residuals is still too large for a smooth scatter plot, one could switch to a gray-scale heatmap / 2D-histogram for the residuals vs predicted plot.
  • For the different tests, I assume they scale quite differently with the number of residuals, and I really wouldn't consider myself an experts when it comes to their respective properties. Still, it might be interesting to explore which tests could be bootstrapped or work with a smaller sub-sample of the residuals. I think Moran's I is a good example: Calculations rapidly become prohibitively slow above 100,000 observations, but smaller sub-samples / bootstrapping seem to be doing the trick in many cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants