Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reject a model with underdispersion but low AIC #422

Closed
nmueller18 opened this issue Jul 3, 2024 · 7 comments
Closed

How to reject a model with underdispersion but low AIC #422

nmueller18 opened this issue Jul 3, 2024 · 7 comments

Comments

@nmueller18
Copy link

First of all: Many thanks for the great package!

I am trying to model tooth loss among human populations before modern surgery (i. e. extraction at the dentist :-) ). I have a empirical data-set with individuals and tooth-loss as binary variable per tooth-possition (i.e. canine, molars etc.). The most important predicting variable is the age of the respective individual.

In the first run of models, and disregarding the information on tooth-position, the data-set is reduced to the number of lost teeth vs. the number of still observable tooth positions (which vary) per individual.
A simple logistic regression (model 1) with binomial probability distribution reveals strong overdispersion:

fit <- glm(formula = cbind(lost_teeth, tooth_positions - lost_teeth) ~ age, family = binomial, data = data_tooth_loss)
AIC(fit)
2123.05
DHARMa::simulateResiduals(fittedModel = fit, plot = T)

logistic_simple

If the individuals are added (model 2), the AIC improves substantially but now the plot shows strong underdispersion. Strangely, the dispersion test is not significant.

fit <- glm(formula = cbind(lost_teeth, tooth_positions - lost_teeth) ~ age + individual_no, family = binomial, data = data_tooth_loss)
AIC(fit)
920.4828
simulationOutput <- DHARMa::simulateResiduals(fittedModel = fit, plot = T)
DHARMa::testDispersion(simulationOutput)

	DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated

data:  simulationOutput
dispersion = 0.95687, p-value = 0.064
alternative hypothesis: two.sided

logistic_simple_ind

A betabinominal model (model 3) takes care of the overdispersion. I use the function betabin of the package aod for that. Despite the fact that DHARMa provides no built-in support for aod, thanks to the vignette, I was able to add a custom function for creating a DHARMa object. There is no overdispersion anymore and the plot of the residual looks fine in my view.

fit_beta <- aod::betabin(formula = cbind(lost_teeth, tooth_positions - lost_teeth) ~ age, ~ 1, data = data_tooth_loss, link = 'logit')
aod::AIC(fit_beta)
1159.566

simulatebetabin<- function(fitted, n, x){
  theta_param = [email protected]
  tp = unlist(x["tooth_positions"])
  pred = aod::predict(fit, type = "response")
  nObs = length(pred)
  sim = matrix(nrow = nObs, ncol = n)
  for(i in 1:n) sim[,i] = FlexReg::rBetaBin(nObs, size = tp, mu = pred, theta = theta_param)
  return(sim)
}
sim = simulatebetabin(fit_beta, 1000, data_tooth_loss)
DHARMaRes = DHARMa::createDHARMa(simulatedResponse = sim, observedResponse = data_tooth_loss$lost_teeth,
                                 fittedPredictedResponse = aod::predict(fit, type = "response"), integerResponse = T)
plot(DHARMaRes, quantreg = T)

betabin

I now face the problem that model 2 has the lowest AIC but that model 3 appears far more elegant and does not suffer from underdispersion. In the vignette and, for example, here you state that underdispersion is less of an issue than overdispersion so I am not sure on what ground I can reject model 2in favour of model 3. Model 2seems clearly overfitted from looking at the figures but is there a measure to read this result from, especially as the dispersion test is not significant?

Any advice is much appreciated!

@melina-leite
Copy link
Collaborator

Hi @nmueller18, I'm sorry for the late reply. Do you still have questions about this issue? I mean, did you figure out how to conduct your analysis or do you still want to discuss it here?
If everything is OK, I'll close this issue; if not, feel free to reply here, and we will try to help you out.
:)

@nmueller18
Copy link
Author

Thank you for getting back to me. Actually, I would still be interested in an informed answer. To summarize my problem: how to decide between two models where the output of DHARMA clearly favours model 3, but model 2 has a lower AIC?

@melina-leite
Copy link
Collaborator

Ok, I'll give it a try to help you out:

The models you presented are a binomial proportion (lost teeth/ lost + remaining teeth). So, your dataset is ONE observation per individual, right? If so, it means that if you use individual as a predictor, there may be overfitting problems (you are estimating a beta for each individual, and each individual has only ONE data point).
Because of that, I would not use individual as a predictor and forget about model 2.

About which model to select, I'll quote a recent answer from Florian on issue #442:

Residual checks are meant to inquire if your model shows some significant misfit.
Residual checks are not a model selection criteria, as they do not account for complexity. Thus, if you want to know if you should add complexity to a model, use a model selection criteria.

Comparing AIC between model 1 and model 3, we see that model 3 fits better (has a lower AIC) and better residual diagnostics, probably because a betabinomial distribution deals better with the overdipersion in your data.

Does it make sense?
:)

@florianhartig
Copy link
Owner

I agree with Melina - if you add one predictor per individual, you can fit the mean perfectly (i.e. you are overfitting massively), which likely also creates the underdispersion (overfitting to noise creates underdispersion). Thus, I would say the model with + individual_no just doesn't make sense theoretically, regardless of the AIC.

The overdispersion of your first model likely arises because you have a number of unobserved predictors that also affect your rate of lost tooth. If the true process is binomial, but you have additional variation, you will see overdispersion. Therefore my first question with overdispersion is always: do you have additional meaningful predictors or can you improve your modelling of the mean.

The fact that you didn't mention other predictors likely means you don't have them. In this case, the betabinominal is fine. Note you could also fit it with glmmTMB, which is supported by DHARMa. An alternative is to put an observation-level RE on (1| individual_no), see e.g. Harrison 2014 https://peerj.com/articles/616/ . However, for various technical reasons, if available, I would prefer the beta binomial, so I would stay with your model 3.

@nmueller18
Copy link
Author

@melina-leite and @florianhartig, many thanks for your input! I was also under the impression that adding the individual does not make much sense but was put off by the lower AIC. End of story: AIC is not everything!
Thank you also for your suggestion how to improve the model! My final model is actually a little bit more complicated, and I use JAGS for the flexibility.
I will close this issue for now.

@florianhartig
Copy link
Owner

Sure, thanks for the feedback - note that also Jags models can be tested with DHARMa, see https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMaForBayesians.html

@nmueller18
Copy link
Author

Thanks to the excellent vignette, I was already able to implement a workflow for DHARMa and JAGS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants