diff --git a/GAMworkshop.Rmd b/GAMworkshop.Rmd index b90ec54..5c0d97f 100644 --- a/GAMworkshop.Rmd +++ b/GAMworkshop.Rmd @@ -71,19 +71,17 @@ label= c("1", "\u03B2", "\u03B1" )) .pull-left[


-

-# Fitting Wiggly Data -### Why `mgcv` is awesome +# Modelling non-linear data with Generalized Additive Models (GAMs) +### Using the `mgcv` package

-

+
`r icons::icon_style(icons::fontawesome("envelope"))` [c.mainey1@nhs.net](mailto:c.mainey1@nhs.net) -`r icons::icon_style(icons::fontawesome("globe"))` [mainard.co.uk](https://www.mainard.co.uk) `r icons::icon_style(icons::fontawesome("github"))` [chrismainey](https://github.com/chrismainey) -`r icons::icon_style(icons::fontawesome("twitter"))` [chrismainey](witter.com/chrismainey) - +`r icons::icon_style(icons::fontawesome("linkedin"), fill = "#005EB8")` [chrismainey](https://www.linkedin.com/in/chrismainey/) +`r icons::icon_style(icons::fontawesome("orcid"), fill = "#005EB8")` [0000-0002-3018-6171](https://orcid.org/0000-0002-3018-6171) ] .pull-right[ @@ -95,10 +93,6 @@ Don't think about it too hard...`r emo::ji("wink")`

] -.qr1[ -QR code https://chrismainey.github.io/fitting_wiggly_data/fitting_wiggly_data.html -] - --- # Regression models on non-linear data @@ -270,7 +264,7 @@ ggplot(dt, aes(y=Y, x=X))+ If we use these in regression, we can get something like: -$$y = a_0 + a_1x + a_2x^2 + ... + a_nx_n$$ +$$y = \alpha + \beta_1x + \beta_2x^2 + \beta_3x^3... + \beta_nx_n^z$$ ] -- @@ -285,10 +279,55 @@ $$y = a_0 + a_1x + a_2x^2 + ... + a_nx_n$$ + [Runge's phenomenon](https://en.wikipedia.org/wiki/Runge%27s_phenomenon#:~:text=In%20the%20mathematical%20field%20of,set%20of%20equispaced%20interpolation%20points.) -

Runge phenomenon

+

Runge phenomenon

] +--- + +# Degrees of freedom (df) + +Within a model, how many 'parts' are free to vary? + +E.g. If we have 3 numbers and we know the average is 5: + +If we have 2 and 7, our final number is not free to vary. +It must be 6: +$$ \frac{2 + 7 + 6}{3} = 5$$ +This means our 'model' is constrained to $n-1$ degrees of freedom + + +-- + +## In regression context: + ++ The number of data points in our model ( $N$ ) limits the df ++ Usually the number of predictors ( $k$ ) in our model is considered the df (one of these is the intercept) ++ "Residual df" are points left to vary in the model, after considering the df: + +$$N-k-1$$ +Helpful post on [CrossValidated](https://stats.stackexchange.com/questions/340007/confused-about-residual-degree-of-freedom) + +--- + +# Overfitting + + > When our model fits both the underlying relationship and the 'noise' percuiliar to our sample data + + + You want to fit the relationship, whilst minimising the noise. + + + This helps 'generalizability': meaning it will predict well on new data. + + +If we allowed total freedom in our model, e.g. a knot at every data point. What would happen? + +--- +class: middle + +# Exercise 1: Load and fit non-linear relationship +Here we will visualise the relationship, view it as a linear regression, and attempt a polynomial fit. + + --- # What if we could do something more suitable? @@ -328,7 +367,7 @@ Figure taken from Noam Ross' GAMs in R course, CC-BY, https://github.com/noamros .pull-right[ -

Spline (PSF)

+

Spline (PSF)

] @@ -346,6 +385,7 @@ ggplot(dt, aes(y=Y, x=X))+ geom_smooth(aes(col="A"), method = "lm", formula = y~ns(x,10), se=FALSE, size=1.2, show.legend = FALSE) ``` + --- # How smooth? @@ -450,7 +490,6 @@ Where: .smaller[ + $f_i$ are smooth functions of the covariates, $xk$, where $k$ is each function basis.] - --- # What does that mean for me? @@ -458,15 +497,33 @@ Where: + _Hastie (1985)_ used knot every point, _Wood (2017)_ uses reduced-rank version - -- +## Issues + ++ We need to chose the right _dimension_ (degrees of freedom / knots) for our smoothers ++ We need to chose the right penalty ($\lambda$) for our smoothers + +### Consequence + ++ If you penalise a smooth of $k$ dimensions, it no longer has $k-1$ degrees of freedom as they are reduced ++ 'Effective degrees of freedom' - the penalized df of the predictors in the model. + +__Note:__
+$df(\lambda) = k$, when $\lambda = 0$ +
+$df(\lambda) \rightarrow 0$, when $\lambda \rightarrow \infty$ + + +--- # mgcv: mixed gam computation vehicle + Prof. Simon Wood's package, pretty much the standard + Included in standard `R` distribution, used in `ggplot2` `geom_smooth` etc. ++ Has sensible defaults for dimensions ++ Estimates the ideal penalty for smooths by various methods, with REML recommended. -- @@ -479,6 +536,7 @@ my_gam <- gam(Y ~ s(X, bs="cr"), data=dt) + `s()` controls smoothers (and other options, `t`, `ti`) + `bs="cr"` telling it to use cubic regression spline ('basis') + Default determined from data, but you can alter this e.g. (`k=10`) ++ Penalty (smoothing parameter) estimation method is set to (`REML`) --- # Model Output: @@ -525,6 +583,37 @@ AIC(my_lm, my_gam) ## Yes, yes it is! +--- +class: middle + +# Exercise 2: Simple GAM fit +We will now fit the same relationship with a GAM using the `mgcv` package. + +--- +class: middle + +# Exercise 3: GAM fitting options +We will now look at varying the fit using things like the degrees of freedom and penalty. +We will also visualise these changes and effects on models. + +--- +class: middle + +# Exercise 4: Multivariable GAMs +We will now generalise to include more than one predictor / smooth function, how they might be combined, and effects on models. We will also progress on to a generalised linear model, using a distribution family + +--- +class: middle + +# Break + +--- +class: middle + +# Exercise 5: Put it all together! +We will now apply what we've learnt to the Framingham cohort study, predicting the binary variable: 'TenYearCHD' using other columns. See how you can use smoothers on th econtinuous variables to get the best fit possible. +___Hint:___ you will need to use AIC or auc to compare models, not R2. + --- # Summary @@ -555,8 +644,7 @@ AIC(my_lm, my_gam) # References and Further reading: #### GitHub code: -https://github.com/chrismainey/fitting_wiggly_data - +https://github.com/chrismainey/GAMworkshop #### Simon Wood's comprehensive book: diff --git a/GAMworkshop.html b/GAMworkshop.html index 0607de1..8353474 100644 --- a/GAMworkshop.html +++ b/GAMworkshop.html @@ -4,7 +4,7 @@ GAMworkshop - + @@ -18,19 +18,17 @@ .pull-left[ <br><br><br> -<br><br> -# Fitting Wiggly Data -### Why `mgcv` is awesome +# Modelling non-linear data with Generalized Additive Models (GAMs) +### Using the `mgcv` package <br><br> -<br><br> +<br> <svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M464 64H48C21.49 64 0 85.49 0 112v288c0 26.51 21.49 48 48 48h416c26.51 0 48-21.49 48-48V112c0-26.51-21.49-48-48-48zm0 48v40.805c-22.422 18.259-58.168 46.651-134.587 106.49-16.841 13.247-50.201 45.072-73.413 44.701-23.208.375-56.579-31.459-73.413-44.701C106.18 199.465 70.425 171.067 48 152.805V112h416zM48 400V214.398c22.914 18.251 55.409 43.862 104.938 82.646 21.857 17.205 60.134 55.186 103.062 54.955 42.717.231 80.509-37.199 103.053-54.947 49.528-38.783 82.032-64.401 104.947-82.653V400H48z"></path></svg> [c.mainey1@nhs.net](mailto:c.mainey1@nhs.net) -<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg> [mainard.co.uk](https://www.mainard.co.uk) <svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> [chrismainey](https://github.com/chrismainey) -<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> [chrismainey](witter.com/chrismainey) - +<svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#005EB8;" xmlns="http://www.w3.org/2000/svg"> <path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"></path></svg> [chrismainey](https://www.linkedin.com/in/chrismainey/) +<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#005EB8;" xmlns="http://www.w3.org/2000/svg"> <path d="M294.75 188.19h-45.92V342h47.47c67.62 0 83.12-51.34 83.12-76.91 0-41.64-26.54-76.9-84.67-76.9zM256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8zm-80.79 360.76h-29.84v-207.5h29.84zm-14.92-231.14a19.57 19.57 0 1 1 19.57-19.57 19.64 19.64 0 0 1-19.57 19.57zM300 369h-81V161.26h80.6c76.73 0 110.44 54.83 110.44 103.85C410 318.39 368.38 369 300 369z"></path></svg> [0000-0002-3018-6171](https://orcid.org/0000-0002-3018-6171) ] .pull-right[ @@ -42,10 +40,6 @@ ] -.qr1[ -<img src="man/figures/qr1.png" style="height:150px;" alt="QR code https://chrismainey.github.io/fitting_wiggly_data/fitting_wiggly_data.html"> -] - --- # Regression models on non-linear data @@ -116,7 +110,7 @@ If we use these in regression, we can get something like: -`$$y = a_0 + a_1x + a_2x^2 + ... + a_nx_n$$` +`$$y = \alpha + \beta_1x + \beta_2x^2 + \beta_3x^3... + \beta_nx_n^z$$` ] -- @@ -131,10 +125,55 @@ + [Runge's phenomenon](https://en.wikipedia.org/wiki/Runge%27s_phenomenon#:~:text=In%20the%20mathematical%20field%20of,set%20of%20equispaced%20interpolation%20points.) -<p style="text-align:center;"><a title="Nicoguaro, CC BY 4.0 &lt;https://creativecommons.org/licenses/by/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Runge_phenomenon.svg"><img width="400" alt="Runge phenomenon" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Runge_phenomenon.svg/512px-Runge_phenomenon.svg.png"></a></p> +<p style="text-align:center;"><a title="Nicoguaro, CC BY 4.0 &lt;https://creativecommons.org/licenses/by/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Runge_phenomenon.svg"><img width="400" alt="Runge phenomenon" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Runge_phenomenon.svg/512px-Runge_phenomenon.svg.png"; alt = "Chart of the shape of polynomial function, oscillation at the edges as the order of the function increases, demonstrating Runges Phenomenon"></a></p> ] +--- + +# Degrees of freedom (df) + +Within a model, how many 'parts' are free to vary? + +E.g. If we have 3 numbers and we know the average is 5: + +If we have 2 and 7, our final number is not free to vary. +It must be 6: +$$ \frac{2 + 7 + 6}{3} = 5$$ +This means our 'model' is constrained to `\(n-1\)` degrees of freedom + + +-- + +## In regression context: + ++ The number of data points in our model ( `\(N\)` ) limits the df ++ Usually the number of predictors ( `\(k\)` ) in our model is considered the df (one of these is the intercept) ++ "Residual df" are points left to vary in the model, after considering the df: + +`$$N-k-1$$` +Helpful post on [CrossValidated](https://stats.stackexchange.com/questions/340007/confused-about-residual-degree-of-freedom) + +--- + +# Overfitting + + > When our model fits both the underlying relationship and the 'noise' percuiliar to our sample data + + + You want to fit the relationship, whilst minimising the noise. + + + This helps 'generalizability': meaning it will predict well on new data. + + +If we allowed total freedom in our model, e.g. a knot at every data point. What would happen? + +--- +class: middle + +# Exercise 1: Load and fit non-linear relationship +Here we will visualise the relationship, view it as a linear regression, and attempt a polynomial fit. + + --- # What if we could do something more suitable? @@ -174,7 +213,7 @@ .pull-right[ -<p style="text-align:center;"><a title="Pearson Scott Foresman, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Spline_(PSF).png"><img width="400" alt="Spline (PSF)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Spline_%28PSF%29.png/512px-Spline_%28PSF%29.png"></a></p> +<p style="text-align:center;"><a title="Pearson Scott Foresman, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Spline_(PSF).png"><img width="400" alt="Spline (PSF)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Spline_%28PSF%29.png/512px-Spline_%28PSF%29.png"; alt = "Drawing of a draftsman bending a flexible piece of wood and using it to draw a smooth curve."></a> </p> ] @@ -187,6 +226,7 @@ <img src="GAMworkshop_files/figure-html/gam1-1.png" alt="Scatter plot from earlier with a wiggly sigmoidal best fit line" width="864" style="display: block; margin: auto;" /> + --- # How smooth? @@ -235,7 +275,6 @@ .smaller[ + `\(f_i\)` are smooth functions of the covariates, `\(xk\)`, where `\(k\)` is each function basis.] - --- # What does that mean for me? @@ -243,15 +282,33 @@ + _Hastie (1985)_ used knot every point, _Wood (2017)_ uses reduced-rank version - -- +## Issues + ++ We need to chose the right _dimension_ (degrees of freedom / knots) for our smoothers ++ We need to chose the right penalty ($\lambda$) for our smoothers + +### Consequence + ++ If you penalise a smooth of `\(k\)` dimensions, it no longer has `\(k-1\)` degrees of freedom as they are reduced ++ 'Effective degrees of freedom' - the penalized df of the predictors in the model. + +__Note:__ <br> +`\(df(\lambda) = k\)`, when `\(\lambda = 0\)` +<br> +`\(df(\lambda) \rightarrow 0\)`, when `\(\lambda \rightarrow \infty\)` + + +--- # mgcv: mixed gam computation vehicle + Prof. Simon Wood's package, pretty much the standard + Included in standard `R` distribution, used in `ggplot2` `geom_smooth` etc. ++ Has sensible defaults for dimensions ++ Estimates the ideal penalty for smooths by various methods, with REML recommended. -- @@ -265,6 +322,7 @@ + `s()` controls smoothers (and other options, `t`, `ti`) + `bs="cr"` telling it to use cubic regression spline ('basis') + Default determined from data, but you can alter this e.g. (`k=10`) ++ Penalty (smoothing parameter) estimation method is set to (`REML`) --- # Model Output: @@ -384,6 +442,37 @@ ## Yes, yes it is! +--- +class: middle + +# Exercise 2: Simple GAM fit +We will now fit the same relationship with a GAM using the `mgcv` package. + +--- +class: middle + +# Exercise 3: GAM fitting options +We will now look at varying the fit using things like the degrees of freedom and penalty. +We will also visualise these changes and effects on models. + +--- +class: middle + +# Exercise 4: Multivariable GAMs +We will now generalise to include more than one predictor / smooth function, how they might be combined, and effects on models. We will also progress on to a generalised linear model, using a distribution family + +--- +class: middle + +# Break + +--- +class: middle + +# Exercise 5: Put it all together! +We will now apply what we've learnt to the Framingham cohort study, predicting the binary variable: 'TenYearCHD' using other columns. See how you can use smoothers on th econtinuous variables to get the best fit possible. +___Hint:___ you will need to use AIC or auc to compare models, not R2. + --- # Summary @@ -414,8 +503,7 @@ # References and Further reading: #### GitHub code: -https://github.com/chrismainey/fitting_wiggly_data - +https://github.com/chrismainey/GAMworkshop #### Simon Wood's comprehensive book: diff --git a/Solutions/exercise5_CHDRisk.R b/Scripts/Exercise5_CHDRisk_CM_answer.R similarity index 69% rename from Solutions/exercise5_CHDRisk.R rename to Scripts/Exercise5_CHDRisk_CM_answer.R index 727f5fd..e93b1bf 100644 --- a/Solutions/exercise5_CHDRisk.R +++ b/Scripts/Exercise5_CHDRisk_CM_answer.R @@ -61,8 +61,41 @@ CHDrisk2 <- gam(TenYearCHD ~ factor(male) +CHDrisk3 <- gam(TenYearCHD ~ factor(male) + + factor(currentSmoker) + + s(age) + + te(sysBP, diaBP) + + s(glucose, by = factor(diabetes)) + + factor(prevalentStroke) + + factor(prevalentHyp) + + s(totChol) + , data = framingham + , family = "binomial" + , metho = "REML") + + +CHDrisk4 <- gam(TenYearCHD ~ factor(male) + + factor(currentSmoker) + + s(age, sysBP) + # + te(age, sysBP) + + s(sysBP) + + s(diaBP) + + s(glucose, by = factor(diabetes)) + + factor(prevalentStroke) + + factor(prevalentHyp) + + s(totChol) + , data = framingham + , family = "binomial" + , metho = "REML") + + + + + gam.check(CHDrisk) gam.check(CHDrisk2) +gam.check(CHDrisk3) +gam.check(CHDrisk4) library(gratia) appraise(CHDrisk) @@ -71,7 +104,12 @@ gratia::draw(CHDrisk) ModelMetrics::auc(CHDrisk) summary(CHDrisk) +summary(CHDrisk2) +summary(CHDrisk3) +summary(CHDrisk4) # Change of 4 ~ asymptotic to 95% significant AIC(CHDrisk) AIC(CHDrisk2) +AIC(CHDrisk3) +AIC(CHDrisk4) diff --git a/Solutions/exercise1_osteo_sarcoma.R b/Scripts/exercise1_osteo_sarcoma_poly.R similarity index 83% rename from Solutions/exercise1_osteo_sarcoma.R rename to Scripts/exercise1_osteo_sarcoma_poly.R index d910f94..cb2f660 100644 --- a/Solutions/exercise1_osteo_sarcoma.R +++ b/Scripts/exercise1_osteo_sarcoma_poly.R @@ -20,7 +20,7 @@ sarcoma <- read_csv("./data/sarcoma.csv", name_repair = "universal") -######## Part 1: Linear regression with single predictor ########## +###################### Part 1: Linear regression with a single predictor ###################################### # With the Sarcoma data, we will examine the relationship between age and incidence # Pick either the Male or Female reporting category and lets consider the number of Cases @@ -36,11 +36,19 @@ View(sarcoma) library(ggplot2) ggplot(sarcoma, aes(x=Age, y=Male.Cases))+ - geom_col(fill="green2", col=1, alpha=0.5)+ + geom_col(fill="firebrick", col=1, alpha=0.5)+ labs(title="Incidence of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+ theme_minimal()+ theme(plot.subtitle = element_text(face="italic")) +ggplot(sarcoma, aes(x=Age, y=Female.Cases))+ + geom_col(fill="orange2", col=1, alpha=0.5)+ + labs(title="Incidence of Osteosarcoma in UK Females, 2016 - 2018", subtitle="Source: Cancer Research UK")+ + theme_minimal()+ + theme(plot.subtitle = element_text(face="italic")) + + + # Does the number of cases appear related to age? - Some increasing relationship # If so, is it linearly? @@ -50,7 +58,13 @@ ggplot(sarcoma, aes(x=Age, y=Male.Cases))+ # Try the same with the rate to see if population size clouds it. ggplot(sarcoma, aes(x=Age, y=Male.Rates))+ - geom_col(fill="dodgerblue2", col=1, alpha=0.5)+ + geom_col(fill="mediumpurple4", col=1, alpha=0.5)+ + labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+ + theme_minimal()+ + theme(plot.subtitle = element_text(face="italic")) + +ggplot(sarcoma, aes(x=Age, y=Female.Rates))+ + geom_col(fill="mediumaquamarine", col=1, alpha=0.5)+ labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+ theme_minimal()+ theme(plot.subtitle = element_text(face="italic")) @@ -64,7 +78,7 @@ ggplot(sarcoma, aes(x=Age, y=Male.Rates))+ -################################################# +################################################################################################################ # Modelling the rate in Males. # Build a linear regression model (lm) to examine it using the summary function @@ -77,7 +91,7 @@ summary(sarcoma_lm) # Let's visualise it using ggplots 'geom_smooth', giving it the method = lm ggplot(sarcoma, aes(x=Age, y=Male.Rates))+ - geom_col(fill="dodgerblue2", col=1, alpha=0.5)+ + geom_col(fill="mediumpurple4", col=1, alpha=0.5)+ geom_smooth(col="red", method = "lm") + labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+ theme_minimal()+ @@ -87,7 +101,8 @@ ggplot(sarcoma, aes(x=Age, y=Male.Rates))+ # Examine this plot. Is the model a good fit? # Is it good for any of the range? -################################################# + +################################################################################################################# # Let's try and do this with a polynomial term or two: Age + Age^2 + Age^3 # You can manually do it, or use the poly() function, which helps reduce correlation @@ -101,7 +116,7 @@ summary(sarcoma_lm_poly) # Visualise this again with ggplot geom_smooth. Add the argument: formula = "y ~ naturalSpline(x)", including your options ggplot(sarcoma, aes(x=Age, y=Male.Rates))+ - geom_col(fill="dodgerblue2", col=1, alpha=0.5)+ + geom_col(fill="mediumpurple4", col=1, alpha=0.5)+ geom_smooth(col="red", method = "lm", formula = "y ~ poly(x,degree=4)") + labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+ theme_minimal()+ @@ -139,7 +154,7 @@ summary(sarcoma_lm_spline2) # Visualise this again with ggplot geom_smooth. Add the argument: formula = "y ~ naturalSpline(x)", including your options ggplot(sarcoma, aes(x=Age, y=Male.Rates))+ - geom_col(fill="dodgerblue2", col=1, alpha=0.5)+ + geom_col(fill="mediumpurple4", col=1, alpha=0.5)+ geom_smooth(col="red", method = "lm", formula = "y ~ naturalSpline(x, df=4)") + labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+ theme_minimal()+ @@ -147,7 +162,7 @@ ggplot(sarcoma, aes(x=Age, y=Male.Rates))+ ggplot(sarcoma, aes(x=Age, y=Male.Rates))+ - geom_col(fill="dodgerblue2", col=1, alpha=0.5)+ + geom_col(fill="mediumpurple4", col=1, alpha=0.5)+ geom_smooth(col="red", method = "lm", formula = "y ~ naturalSpline(x, knots= c(24,49,79))") + labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+ theme_minimal()+ diff --git a/Scripts/exercise2_osteo_sarcoma_GAM.R b/Scripts/exercise2_osteo_sarcoma_GAM.R new file mode 100644 index 0000000..e10663a --- /dev/null +++ b/Scripts/exercise2_osteo_sarcoma_GAM.R @@ -0,0 +1,87 @@ +# Here is an example using Osteosarcoma, a type of bone cancer. +# This is based on incidence data, per 100,00 population 2016-2018, +# Published by Cancer Research UK: +# https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/bone-sarcoma/incidence +# The data have been slightly regrouped to aid the exercise. + +# Load the data: + +library(readr) +library(ggplot2) +sarcoma <- read_csv("./data/sarcoma.csv", name_repair = "universal") + +# Columns in the data are: +# Age: Upper number of 5-year age bands. e.g. 0 - 4, is coded as 4. 90+ is coded as 95 +# for the sake of this exercise. +# Female.Cases: Number of incident cases reported in females. +# Male.Cases: Number of incident cases reported in males. +# Female.Rates: Incidence Rate per 100,000 population, in this age group, for females +# Male.Rates: Incidence Rate per 100,000 population, in this age group, for males + + + + +############################################################ + +# Now let's fit it as a GAM using mgcv package. +# Use the 's()' function to wrap a smoother around age. We'll let it chose for use at first. + +library(mgcv) +sarcoma_gam1 <- gam(Male.Rates ~ s(Age),data=sarcoma) + +summary(sarcoma_gam1) + + +# Plot the results +plot(sarcoma_gam1, residuals = TRUE, pch = 1) + + +#### Can also use alternatives like `gratia` or mgcVis packages: + +# https://gavinsimpson.github.io/gratia/index.html +gratia::draw(sarcoma_gam1, residuals = TRUE) + +library(mgcViz) +# https://mfasiolo.github.io/mgcViz/articles/mgcviz.html +b <- mgcViz::getViz(sarcoma_gam1) + +o <- plot( sm(b, 1) ) +o + l_fitLine(colour = "red") + l_rug(mapping = aes(x=x, y=y), alpha = 0.8) + + l_ciLine(mul = 5, colour = "blue", linetype = 2) + + l_points(shape = 19, size = 1, alpha = 0.1) + theme_classic() + + + +# Extract the model coefficients with `coef` function. How many coefficients are there for basis functions that make up this smooth? +coef(sarcoma_gam1) + + +# Can you interpret any of those coefficients? +# How do you know if they are 'significant'? + +summary(sarcoma_gam1) +anova(sarcoma_gam1) + +# Both show the pooled significance of the whole smoother. + +# Predict and fit manually: +# We'll create a colmn called 'gam1_preds' for the predictions in our original data frame + +sarcoma$gam1_preds <- predict(sarcoma_gam1, newdata = sarcoma) + +# also compare to predctions fomr our original model + +sarcoma$lm_preds <- predict(sarcoma_lm, newdata = sarcoma) +sarcoma$lm_poly_preds <- predict(sarcoma_lm_poly, newdata = sarcoma) + +ggplot(sarcoma, aes(y=Male.Rates))+ + geom_point(aes(x=gam1_preds), col="mediumpurple")+ + geom_point(aes(x=lm_preds), col="mediumaquamarine")+ + geom_point(aes(x=lm_poly_preds), col="orange2")+ + geom_abline(intercept=0, slope=1, col="red") + +# In general, purple points (gam) are closer to the line (perfect prediction) compared to the regular lm. +# Lets sum up the residual error, squaring it so +/- don't cancel. +sum(residuals(sarcoma_lm)^2) +sum(residuals(sarcoma_lm_poly)^2) +sum(residuals(sarcoma_gam1)^2) diff --git a/Solutions/exercise2_osteo_sarcoma.R b/Scripts/exercise3_osteo_sarcoma_gam_options.R similarity index 96% rename from Solutions/exercise2_osteo_sarcoma.R rename to Scripts/exercise3_osteo_sarcoma_gam_options.R index fecec40..b74b8b8 100644 --- a/Solutions/exercise2_osteo_sarcoma.R +++ b/Scripts/exercise3_osteo_sarcoma_gam_options.R @@ -85,7 +85,9 @@ plot(sarcoma_gam8, residuals = TRUE, pch = 1) plot(sarcoma_gam9, residuals = TRUE, pch = 1) # The higher penalties reduce the 'wiggliness' - +# Extreme: +sarcoma_gam7a <- gam(Male.Rates ~ s(Age, bs="cr", k=12),data=sarcoma, sp=1e12) +plot(sarcoma_gam7a, residuals = TRUE, pch = 1) ############################################################ # Model fitting and diagnostics @@ -119,3 +121,5 @@ check.gamViz(getViz(sarcoma_gam10)) # mgcv has functions already for some of diagnostics qq.gam(sarcoma_gam10, pch=1) + + diff --git a/Solutions/exercise3_sarcoma_and_los.R b/Scripts/exercise4_sarcoma_and_los_multiplevariable.R similarity index 99% rename from Solutions/exercise3_sarcoma_and_los.R rename to Scripts/exercise4_sarcoma_and_los_multiplevariable.R index b6d66ee..4f4ba73 100644 --- a/Solutions/exercise3_sarcoma_and_los.R +++ b/Scripts/exercise4_sarcoma_and_los_multiplevariable.R @@ -176,4 +176,6 @@ concurvity(gam_death2, full = TRUE) concurvity(gam_death2, full = FALSE) +# Is there a problem with concuirvity here? + diff --git a/Scripts/exercise5_CHDRisk.R b/Scripts/exercise5_CHDRisk.R new file mode 100644 index 0000000..7aca176 --- /dev/null +++ b/Scripts/exercise5_CHDRisk.R @@ -0,0 +1,39 @@ +### Examining Framingham data + +#The Framingham Heart disease study is a long-standing cohort study that is responsible for +#much of what we know about heart diseases and associated conditions. For more info, see: +#https://en.wikipedia.org/wiki/Framingham_Heart_Study + +# We are going to use GAMs to investigate some of it: + +framingham <- read.csv("./data/framingham.csv") + +# The columns we have are: +# male 0 = Female; 1 = Male +# age Age at examination time. +# education 1 = Some High School; 2 = High School or GED; +# 3 = Some College or Vocational School; 4 = college +# currentSmoker 0 = nonsmoker; 1 = smoker +# cigsPerDay number of cigarettes smoked per day (estimated average) +# BPMeds 0 = Not on Blood Pressure medications; 1 = Is on Blood Pressure medications +# prevalentStroke +# prevalentHyp +# diabetes 0 = No; 1 = Yes +# totChol mg/dL +# sysBP mmHg +# diaBP mmHg +# BMI Body Mass Index calculated as: Weight (kg) / Height(meter-squared) +# heartRate Beats/Min (Ventricular) +# glucose mg/dL +# TenYearCHD 0 = No CHD at 10-year follow-up; 1 = CHD at 10-year follow-up + + +# Try and build the best regression model you can: + +library(mgcv) + +CHDrisk <- gam(TenYearCHD ~ # enter your model terms here + , data = framingham + , family = "binomial" + , metho = "REML") + diff --git a/xaringan-themer.css b/xaringan-themer.css index 44d149c..fa0c614 100644 --- a/xaringan-themer.css +++ b/xaringan-themer.css @@ -159,7 +159,9 @@ img, video, iframe { } blockquote { border-left: solid 5px #23395b80; + background-color: #e0e0e0 ; padding-left: 1em; + padding: 0.05rem; } .remark-slide table { margin: auto;