Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a first version of the model #95

Open
Tracked by #91
emmamendelsohn opened this issue Jun 28, 2024 · 1 comment
Open
Tracked by #91

Create a first version of the model #95

emmamendelsohn opened this issue Jun 28, 2024 · 1 comment

Comments

@emmamendelsohn
Copy link
Collaborator

emmamendelsohn commented Jun 28, 2024

Current status (2024-06-28): we have a workflow for model splitting and fitting using tidymodels. There is some commented out code to create Ceteris Paribus profiles (https://github.com/ecohealthalliance/open-rvfcast/blob/feature/outbreak-layer/_targets.R#L575-L623). I think this code is working.

We still need to set up a target that selects the best parameters after cross-validation. This should be doable through tidymodels. Then we need to fit the final version of the model.

Something to look into: it's unclear whether tidymodels feeds the interaction constraints into the xgboost call (https://github.com/ecohealthalliance/open-rvfcast/blob/feature/outbreak-layer/R/model_specs.R#L25). You can potentially check this by extracting the model object from tidymodels and inspecting it. Otherwise you can look at the ceteris parabus plots - the lines should be fully parallel for the variable area, which is the variable that has the constraint on it. If the constraint is not working as expected, you may need to lift the workflow out of tidymodels.

As a conceptual note, we're including the interaction constraint to prevent area from interacting with other variables, as a way to normalize results to polygon area size. TBH, I'm struggling with the logic behind this. To me, it seems like splitting on area still enforces the relationship that greater area -> greater outbreak probability? Or perhaps the idea is that, because the area splits are independent of the other variables, the model basically generates predictions for every "level" (as defined by the splits) of area?

Below are some notes on addressing the rarity of first outbreaks. WAHIS includes the first outbreak point and subsequent outbreaks that are part of the same event. Below we have discussed ways to handle this, but I don't think it's an immediate priority.

  • Need to code first outbreak in a thread versus subsequent outbreaks
  • Stratify train/test and blocking based on first and subsequent events. Evaluate model performance for each.
  • To tune performance for first events: upweight new events in the data and/or write custom evaluation function (weighted logistic error)

Relevant papers on spatial models.

@emmamendelsohn emmamendelsohn changed the title Complete model training Model fitting Jun 28, 2024
@emmamendelsohn emmamendelsohn changed the title Model fitting Create a first version on the model Jun 28, 2024
@emmamendelsohn emmamendelsohn changed the title Create a first version on the model Create a first version of the model Jun 28, 2024
@n8layman
Copy link
Collaborator

So by specifying area in the interaction constraints, we are forcing xgboost to either split on area alone or to split on a mix of the other explanatory variables. That then means that the influence of all the other variables is independent of area, right? That seems kind of cool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants