diff --git a/GAMworkshop.Rmd b/GAMworkshop.Rmd
index b90ec54..5c0d97f 100644
--- a/GAMworkshop.Rmd
+++ b/GAMworkshop.Rmd
@@ -71,19 +71,17 @@ label= c("1", "\u03B2", "\u03B1" ))
 .pull-left[
 
 <br><br><br>
-<br><br>
 
-# Fitting Wiggly Data
-### Why `mgcv` is awesome
+# Modelling non-linear data with Generalized Additive Models (GAMs)
+### Using the `mgcv` package
 
 <br><br>
-<br><br>
+<br>
 
 `r icons::icon_style(icons::fontawesome("envelope"))` [c.mainey1@nhs.net](mailto:c.mainey1@nhs.net)
-`r icons::icon_style(icons::fontawesome("globe"))` [mainard.co.uk](https://www.mainard.co.uk)
 `r icons::icon_style(icons::fontawesome("github"))` [chrismainey](https://github.com/chrismainey)
-`r icons::icon_style(icons::fontawesome("twitter"))` [chrismainey](witter.com/chrismainey)
-
+`r icons::icon_style(icons::fontawesome("linkedin"), fill = "#005EB8")`  [chrismainey](https://www.linkedin.com/in/chrismainey/)
+`r icons::icon_style(icons::fontawesome("orcid"), fill = "#005EB8")` [0000-0002-3018-6171](https://orcid.org/0000-0002-3018-6171)
 ]
 
 .pull-right[
@@ -95,10 +93,6 @@ Don't think about it too hard...`r emo::ji("wink")`  </p>
 
 ]
 
-.qr1[
-<img src="man/figures/qr1.png" style="height:150px;" alt="QR code https://chrismainey.github.io/fitting_wiggly_data/fitting_wiggly_data.html">
-]
-
 ---
 
 # Regression models on non-linear data
@@ -270,7 +264,7 @@ ggplot(dt, aes(y=Y, x=X))+
 
 If we use these in regression, we can get something like:
 
-$$y = a_0 + a_1x + a_2x^2 + ... + a_nx_n$$
+$$y = \alpha + \beta_1x + \beta_2x^2 + \beta_3x^3... + \beta_nx_n^z$$
 ]
 
 --
@@ -285,10 +279,55 @@ $$y = a_0 + a_1x + a_2x^2 + ... + a_nx_n$$
   + [Runge's phenomenon](https://en.wikipedia.org/wiki/Runge%27s_phenomenon#:~:text=In%20the%20mathematical%20field%20of,set%20of%20equispaced%20interpolation%20points.)
 
 
-<p style="text-align:center;"><a title="Nicoguaro, CC BY 4.0 &lt;https://creativecommons.org/licenses/by/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Runge_phenomenon.svg"><img width="400" alt="Runge phenomenon" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Runge_phenomenon.svg/512px-Runge_phenomenon.svg.png"></a></p>
+<p style="text-align:center;"><a title="Nicoguaro, CC BY 4.0 &lt;https://creativecommons.org/licenses/by/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Runge_phenomenon.svg"><img width="400" alt="Runge phenomenon" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Runge_phenomenon.svg/512px-Runge_phenomenon.svg.png"; alt = "Chart of the shape of polynomial function, oscillation at the edges as the order of the function increases, demonstrating Runges Phenomenon"></a></p>
 
 ]
 
+---
+
+# Degrees of freedom (df)
+
+Within a model, how many 'parts' are free to vary?
+
+E.g. If we have 3 numbers and we know the average is 5:
+
+If we have 2 and 7, our final number is not free to vary.
+It must be 6:
+$$ \frac{2 + 7 + 6}{3} = 5$$
+This means our 'model' is constrained to $n-1$ degrees of freedom
+
+
+--
+
+## In regression context:
+
++ The number of data points in our model ( $N$ ) limits the df
++ Usually the number of predictors ( $k$ ) in our model is considered the df (one of these is the intercept)
++ "Residual df" are points left to vary in the model, after considering the df:
+
+$$N-k-1$$
+Helpful post on [CrossValidated](https://stats.stackexchange.com/questions/340007/confused-about-residual-degree-of-freedom)
+
+---
+
+# Overfitting
+
+ > When our model fits both the underlying relationship and the 'noise' percuiliar to our sample data
+ 
+ + You want to fit the relationship, whilst minimising the noise.
+ 
+ + This helps 'generalizability': meaning it will predict well on new data.
+ 
+
+If we allowed total freedom in our model, e.g. a knot at every data point. What would happen?
+
+---
+class: middle
+
+# Exercise 1: Load and fit non-linear relationship
+Here we will visualise the relationship, view it as a linear regression, and attempt a polynomial fit.
+
+
 ---
 
 # What if we could do something more suitable?
@@ -328,7 +367,7 @@ Figure taken from Noam Ross' GAMs in R course, CC-BY, https://github.com/noamros
 
 .pull-right[
 
-<p style="text-align:center;"><a title="Pearson Scott Foresman, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Spline_(PSF).png"><img width="400" alt="Spline (PSF)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Spline_%28PSF%29.png/512px-Spline_%28PSF%29.png"></a></p>
+<p style="text-align:center;"><a title="Pearson Scott Foresman, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Spline_(PSF).png"><img width="400" alt="Spline (PSF)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Spline_%28PSF%29.png/512px-Spline_%28PSF%29.png"; alt = "Drawing of a draftsman bending a flexible piece of wood and using it to draw a smooth curve."></a> </p>
 
 ]
 
@@ -346,6 +385,7 @@ ggplot(dt, aes(y=Y, x=X))+
   geom_smooth(aes(col="A"), method = "lm", formula = y~ns(x,10), se=FALSE, size=1.2, show.legend = FALSE)
 ```
 
+
 ---
 
 # How smooth?
@@ -450,7 +490,6 @@ Where:
 .smaller[
 + $f_i$ are smooth functions of the covariates, $xk$, where $k$ is each function basis.]
 
-
 ---
 # What does that mean for me?
 
@@ -458,15 +497,33 @@ Where:
 
 + _Hastie (1985)_ used knot every point, _Wood (2017)_ uses reduced-rank version
 
-
 --
 
+## Issues
+
++ We need to chose the right _dimension_ (degrees of freedom / knots) for our smoothers
++ We need to chose the right penalty ($\lambda$) for our smoothers
+
+### Consequence
+
++ If you penalise a smooth of $k$ dimensions, it no longer has $k-1$ degrees of freedom as they are reduced
++ 'Effective degrees of freedom' - the penalized df of the predictors in the model.
+
+__Note:__
 <br>
+$df(\lambda) = k$, when $\lambda = 0$ 
+<br>
+$df(\lambda) \rightarrow 0$, when $\lambda \rightarrow \infty$ 
+
+
+---
 
 # mgcv: mixed gam computation vehicle
 
 + Prof. Simon Wood's package, pretty much the standard
 + Included in standard `R` distribution, used in `ggplot2` `geom_smooth` etc.
++ Has sensible defaults for dimensions
++ Estimates the ideal penalty for smooths by various methods, with REML recommended.
 
 --
 
@@ -479,6 +536,7 @@ my_gam <- gam(Y ~ s(X, bs="cr"), data=dt)
 + `s()` controls smoothers (and other options, `t`, `ti`)
 + `bs="cr"` telling it to use cubic regression spline ('basis')
 + Default determined from data, but you can alter this e.g. (`k=10`)
++ Penalty (smoothing parameter) estimation method is set to (`REML`)
 ---
 
 # Model Output:
@@ -525,6 +583,37 @@ AIC(my_lm, my_gam)
 
 ## Yes, yes it is!
 
+---
+class: middle
+
+# Exercise 2: Simple GAM fit
+We will now fit the same relationship with a GAM using the  `mgcv` package.
+
+---
+class: middle
+
+# Exercise 3: GAM fitting options
+We will now look at varying the fit using things like the degrees of freedom and penalty.
+We will also visualise these changes and effects on models.
+
+---
+class: middle
+
+# Exercise 4: Multivariable GAMs
+We will now generalise to include more than one predictor / smooth function, how they might be combined, and effects on models.  We will also progress on to a generalised linear model, using a distribution family
+
+---
+class: middle
+
+# Break
+
+---
+class: middle
+
+# Exercise 5: Put it all together!
+We will now apply what we've learnt to the Framingham cohort study, predicting the binary variable: 'TenYearCHD' using other columns.  See how you can use smoothers on th econtinuous variables to get the best fit possible.
+___Hint:___ you will need to use AIC or auc to compare models, not R2.
+
 ---
 
 # Summary
@@ -555,8 +644,7 @@ AIC(my_lm, my_gam)
 # References and Further reading:
 
 #### GitHub code: 
-https://github.com/chrismainey/fitting_wiggly_data
-
+https://github.com/chrismainey/GAMworkshop
 
 
 #### Simon Wood's comprehensive book:
diff --git a/GAMworkshop.html b/GAMworkshop.html
index 0607de1..8353474 100644
--- a/GAMworkshop.html
+++ b/GAMworkshop.html
@@ -4,7 +4,7 @@
     <title>GAMworkshop</title>
     <meta charset="utf-8" />
     <meta name="author" content="Chris Mainey" />
-    <meta name="date" content="2023-09-30" />
+    <meta name="date" content="2023-10-11" />
     <script src="libs/header-attrs/header-attrs.js"></script>
     <link rel="stylesheet" href="xaringan-themer.css" type="text/css" />
   </head>
@@ -18,19 +18,17 @@
 .pull-left[
 
 &lt;br&gt;&lt;br&gt;&lt;br&gt;
-&lt;br&gt;&lt;br&gt;
 
-# Fitting Wiggly Data
-### Why `mgcv` is awesome
+# Modelling non-linear data with Generalized Additive Models (GAMs)
+### Using the `mgcv` package
 
 &lt;br&gt;&lt;br&gt;
-&lt;br&gt;&lt;br&gt;
+&lt;br&gt;
 
 &lt;svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"&gt;  &lt;path d="M464 64H48C21.49 64 0 85.49 0 112v288c0 26.51 21.49 48 48 48h416c26.51 0 48-21.49 48-48V112c0-26.51-21.49-48-48-48zm0 48v40.805c-22.422 18.259-58.168 46.651-134.587 106.49-16.841 13.247-50.201 45.072-73.413 44.701-23.208.375-56.579-31.459-73.413-44.701C106.18 199.465 70.425 171.067 48 152.805V112h416zM48 400V214.398c22.914 18.251 55.409 43.862 104.938 82.646 21.857 17.205 60.134 55.186 103.062 54.955 42.717.231 80.509-37.199 103.053-54.947 49.528-38.783 82.032-64.401 104.947-82.653V400H48z"&gt;&lt;/path&gt;&lt;/svg&gt; [c.mainey1@nhs.net](mailto:c.mainey1@nhs.net)
-&lt;svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"&gt;  &lt;path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"&gt;&lt;/path&gt;&lt;/svg&gt; [mainard.co.uk](https://www.mainard.co.uk)
 &lt;svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"&gt;  &lt;path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"&gt;&lt;/path&gt;&lt;/svg&gt; [chrismainey](https://github.com/chrismainey)
-&lt;svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"&gt;  &lt;path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"&gt;&lt;/path&gt;&lt;/svg&gt; [chrismainey](witter.com/chrismainey)
-
+&lt;svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#005EB8;" xmlns="http://www.w3.org/2000/svg"&gt;  &lt;path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"&gt;&lt;/path&gt;&lt;/svg&gt;  [chrismainey](https://www.linkedin.com/in/chrismainey/)
+&lt;svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:#005EB8;" xmlns="http://www.w3.org/2000/svg"&gt;  &lt;path d="M294.75 188.19h-45.92V342h47.47c67.62 0 83.12-51.34 83.12-76.91 0-41.64-26.54-76.9-84.67-76.9zM256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8zm-80.79 360.76h-29.84v-207.5h29.84zm-14.92-231.14a19.57 19.57 0 1 1 19.57-19.57 19.64 19.64 0 0 1-19.57 19.57zM300 369h-81V161.26h80.6c76.73 0 110.44 54.83 110.44 103.85C410 318.39 368.38 369 300 369z"&gt;&lt;/path&gt;&lt;/svg&gt; [0000-0002-3018-6171](https://orcid.org/0000-0002-3018-6171)
 ]
 
 .pull-right[
@@ -42,10 +40,6 @@
 
 ]
 
-.qr1[
-&lt;img src="man/figures/qr1.png" style="height:150px;" alt="QR code https://chrismainey.github.io/fitting_wiggly_data/fitting_wiggly_data.html"&gt;
-]
-
 ---
 
 # Regression models on non-linear data
@@ -116,7 +110,7 @@
 
 If we use these in regression, we can get something like:
 
-`$$y = a_0 + a_1x + a_2x^2 + ... + a_nx_n$$`
+`$$y = \alpha + \beta_1x + \beta_2x^2 + \beta_3x^3... + \beta_nx_n^z$$`
 ]
 
 --
@@ -131,10 +125,55 @@
   + [Runge's phenomenon](https://en.wikipedia.org/wiki/Runge%27s_phenomenon#:~:text=In%20the%20mathematical%20field%20of,set%20of%20equispaced%20interpolation%20points.)
 
 
-&lt;p style="text-align:center;"&gt;&lt;a title="Nicoguaro, CC BY 4.0 &amp;lt;https://creativecommons.org/licenses/by/4.0&amp;gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Runge_phenomenon.svg"&gt;&lt;img width="400" alt="Runge phenomenon" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Runge_phenomenon.svg/512px-Runge_phenomenon.svg.png"&gt;&lt;/a&gt;&lt;/p&gt;
+&lt;p style="text-align:center;"&gt;&lt;a title="Nicoguaro, CC BY 4.0 &amp;lt;https://creativecommons.org/licenses/by/4.0&amp;gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Runge_phenomenon.svg"&gt;&lt;img width="400" alt="Runge phenomenon" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/0a/Runge_phenomenon.svg/512px-Runge_phenomenon.svg.png"; alt = "Chart of the shape of polynomial function, oscillation at the edges as the order of the function increases, demonstrating Runges Phenomenon"&gt;&lt;/a&gt;&lt;/p&gt;
 
 ]
 
+---
+
+# Degrees of freedom (df)
+
+Within a model, how many 'parts' are free to vary?
+
+E.g. If we have 3 numbers and we know the average is 5:
+
+If we have 2 and 7, our final number is not free to vary.
+It must be 6:
+$$ \frac{2 + 7 + 6}{3} = 5$$
+This means our 'model' is constrained to `\(n-1\)` degrees of freedom
+
+
+--
+
+## In regression context:
+
++ The number of data points in our model ( `\(N\)` ) limits the df
++ Usually the number of predictors ( `\(k\)` ) in our model is considered the df (one of these is the intercept)
++ "Residual df" are points left to vary in the model, after considering the df:
+
+`$$N-k-1$$`
+Helpful post on [CrossValidated](https://stats.stackexchange.com/questions/340007/confused-about-residual-degree-of-freedom)
+
+---
+
+# Overfitting
+
+ &gt; When our model fits both the underlying relationship and the 'noise' percuiliar to our sample data
+ 
+ + You want to fit the relationship, whilst minimising the noise.
+ 
+ + This helps 'generalizability': meaning it will predict well on new data.
+ 
+
+If we allowed total freedom in our model, e.g. a knot at every data point. What would happen?
+
+---
+class: middle
+
+# Exercise 1: Load and fit non-linear relationship
+Here we will visualise the relationship, view it as a linear regression, and attempt a polynomial fit.
+
+
 ---
 
 # What if we could do something more suitable?
@@ -174,7 +213,7 @@
 
 .pull-right[
 
-&lt;p style="text-align:center;"&gt;&lt;a title="Pearson Scott Foresman, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Spline_(PSF).png"&gt;&lt;img width="400" alt="Spline (PSF)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Spline_%28PSF%29.png/512px-Spline_%28PSF%29.png"&gt;&lt;/a&gt;&lt;/p&gt;
+&lt;p style="text-align:center;"&gt;&lt;a title="Pearson Scott Foresman, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Spline_(PSF).png"&gt;&lt;img width="400" alt="Spline (PSF)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/Spline_%28PSF%29.png/512px-Spline_%28PSF%29.png"; alt = "Drawing of a draftsman bending a flexible piece of wood and using it to draw a smooth curve."&gt;&lt;/a&gt; &lt;/p&gt;
 
 ]
 
@@ -187,6 +226,7 @@
 
 &lt;img src="GAMworkshop_files/figure-html/gam1-1.png" alt="Scatter plot from earlier with a wiggly sigmoidal best fit line" width="864" style="display: block; margin: auto;" /&gt;
 
+
 ---
 
 # How smooth?
@@ -235,7 +275,6 @@
 .smaller[
 + `\(f_i\)` are smooth functions of the covariates, `\(xk\)`, where `\(k\)` is each function basis.]
 
-
 ---
 # What does that mean for me?
 
@@ -243,15 +282,33 @@
 
 + _Hastie (1985)_ used knot every point, _Wood (2017)_ uses reduced-rank version
 
-
 --
 
+## Issues
+
++ We need to chose the right _dimension_ (degrees of freedom / knots) for our smoothers
++ We need to chose the right penalty ($\lambda$) for our smoothers
+
+### Consequence
+
++ If you penalise a smooth of `\(k\)` dimensions, it no longer has `\(k-1\)` degrees of freedom as they are reduced
++ 'Effective degrees of freedom' - the penalized df of the predictors in the model.
+
+__Note:__
 &lt;br&gt;
+`\(df(\lambda) = k\)`, when `\(\lambda = 0\)` 
+&lt;br&gt;
+`\(df(\lambda) \rightarrow 0\)`, when `\(\lambda \rightarrow \infty\)` 
+
+
+---
 
 # mgcv: mixed gam computation vehicle
 
 + Prof. Simon Wood's package, pretty much the standard
 + Included in standard `R` distribution, used in `ggplot2` `geom_smooth` etc.
++ Has sensible defaults for dimensions
++ Estimates the ideal penalty for smooths by various methods, with REML recommended.
 
 --
 
@@ -265,6 +322,7 @@
 + `s()` controls smoothers (and other options, `t`, `ti`)
 + `bs="cr"` telling it to use cubic regression spline ('basis')
 + Default determined from data, but you can alter this e.g. (`k=10`)
++ Penalty (smoothing parameter) estimation method is set to (`REML`)
 ---
 
 # Model Output:
@@ -384,6 +442,37 @@
 
 ## Yes, yes it is!
 
+---
+class: middle
+
+# Exercise 2: Simple GAM fit
+We will now fit the same relationship with a GAM using the  `mgcv` package.
+
+---
+class: middle
+
+# Exercise 3: GAM fitting options
+We will now look at varying the fit using things like the degrees of freedom and penalty.
+We will also visualise these changes and effects on models.
+
+---
+class: middle
+
+# Exercise 4: Multivariable GAMs
+We will now generalise to include more than one predictor / smooth function, how they might be combined, and effects on models.  We will also progress on to a generalised linear model, using a distribution family
+
+---
+class: middle
+
+# Break
+
+---
+class: middle
+
+# Exercise 5: Put it all together!
+We will now apply what we've learnt to the Framingham cohort study, predicting the binary variable: 'TenYearCHD' using other columns.  See how you can use smoothers on th econtinuous variables to get the best fit possible.
+___Hint:___ you will need to use AIC or auc to compare models, not R2.
+
 ---
 
 # Summary
@@ -414,8 +503,7 @@
 # References and Further reading:
 
 #### GitHub code: 
-https://github.com/chrismainey/fitting_wiggly_data
-
+https://github.com/chrismainey/GAMworkshop
 
 
 #### Simon Wood's comprehensive book:
diff --git a/Solutions/exercise5_CHDRisk.R b/Scripts/Exercise5_CHDRisk_CM_answer.R
similarity index 69%
rename from Solutions/exercise5_CHDRisk.R
rename to Scripts/Exercise5_CHDRisk_CM_answer.R
index 727f5fd..e93b1bf 100644
--- a/Solutions/exercise5_CHDRisk.R
+++ b/Scripts/Exercise5_CHDRisk_CM_answer.R
@@ -61,8 +61,41 @@ CHDrisk2 <- gam(TenYearCHD ~  factor(male)
 
 
 
+CHDrisk3 <- gam(TenYearCHD ~  factor(male)
+                + factor(currentSmoker)
+                + s(age)
+                + te(sysBP, diaBP)
+                + s(glucose, by = factor(diabetes))
+                + factor(prevalentStroke)
+                + factor(prevalentHyp)
+                + s(totChol)
+                , data = framingham
+                , family = "binomial"
+                , metho = "REML")
+
+
+CHDrisk4 <- gam(TenYearCHD ~  factor(male)
+                + factor(currentSmoker)
+                + s(age, sysBP)
+               # + te(age, sysBP)
+                + s(sysBP)
+                + s(diaBP)
+                + s(glucose, by = factor(diabetes))
+                + factor(prevalentStroke)
+                + factor(prevalentHyp)
+                + s(totChol)
+                , data = framingham
+                , family = "binomial"
+                , metho = "REML")
+
+
+
+
+
 gam.check(CHDrisk)
 gam.check(CHDrisk2)
+gam.check(CHDrisk3)
+gam.check(CHDrisk4)
 
 library(gratia)
 appraise(CHDrisk)
@@ -71,7 +104,12 @@ gratia::draw(CHDrisk)
 
 ModelMetrics::auc(CHDrisk)
 summary(CHDrisk)
+summary(CHDrisk2)
+summary(CHDrisk3)
+summary(CHDrisk4)
 
 # Change of 4 ~ asymptotic to 95% significant
 AIC(CHDrisk)
 AIC(CHDrisk2)
+AIC(CHDrisk3)
+AIC(CHDrisk4)
diff --git a/Solutions/exercise1_osteo_sarcoma.R b/Scripts/exercise1_osteo_sarcoma_poly.R
similarity index 83%
rename from Solutions/exercise1_osteo_sarcoma.R
rename to Scripts/exercise1_osteo_sarcoma_poly.R
index d910f94..cb2f660 100644
--- a/Solutions/exercise1_osteo_sarcoma.R
+++ b/Scripts/exercise1_osteo_sarcoma_poly.R
@@ -20,7 +20,7 @@ sarcoma <- read_csv("./data/sarcoma.csv", name_repair = "universal")
 
 
 
-######## Part 1: Linear regression with single predictor  ##########
+###################### Part 1: Linear regression with a single predictor  ######################################
 
 # With the Sarcoma data, we will examine the relationship between age and incidence
 # Pick either the Male or Female reporting category and lets consider the number of Cases
@@ -36,11 +36,19 @@ View(sarcoma)
 
 library(ggplot2)
 ggplot(sarcoma, aes(x=Age, y=Male.Cases))+
-  geom_col(fill="green2", col=1, alpha=0.5)+
+  geom_col(fill="firebrick", col=1, alpha=0.5)+
   labs(title="Incidence of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+
   theme_minimal()+
   theme(plot.subtitle = element_text(face="italic"))
 
+ggplot(sarcoma, aes(x=Age, y=Female.Cases))+
+  geom_col(fill="orange2", col=1, alpha=0.5)+
+  labs(title="Incidence of Osteosarcoma in UK Females, 2016 - 2018", subtitle="Source: Cancer Research UK")+
+  theme_minimal()+
+  theme(plot.subtitle = element_text(face="italic"))
+
+
+
 
 # Does the number of cases appear related to age?  -  Some increasing relationship
 # If so, is it linearly?
@@ -50,7 +58,13 @@ ggplot(sarcoma, aes(x=Age, y=Male.Cases))+
 # Try the same with the rate to see if population size clouds it.
 
 ggplot(sarcoma, aes(x=Age, y=Male.Rates))+
-  geom_col(fill="dodgerblue2", col=1, alpha=0.5)+
+  geom_col(fill="mediumpurple4", col=1, alpha=0.5)+
+  labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+
+  theme_minimal()+
+  theme(plot.subtitle = element_text(face="italic"))
+
+ggplot(sarcoma, aes(x=Age, y=Female.Rates))+
+  geom_col(fill="mediumaquamarine", col=1, alpha=0.5)+
   labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+
   theme_minimal()+
   theme(plot.subtitle = element_text(face="italic"))
@@ -64,7 +78,7 @@ ggplot(sarcoma, aes(x=Age, y=Male.Rates))+
 
 
 
-#################################################
+################################################################################################################
 # Modelling the rate in Males.
 
 # Build a linear regression model (lm) to examine it using the summary function
@@ -77,7 +91,7 @@ summary(sarcoma_lm)
 # Let's visualise it using ggplots 'geom_smooth', giving it the method = lm
 
 ggplot(sarcoma, aes(x=Age, y=Male.Rates))+
-  geom_col(fill="dodgerblue2", col=1, alpha=0.5)+
+  geom_col(fill="mediumpurple4", col=1, alpha=0.5)+
   geom_smooth(col="red", method = "lm") +
   labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+
   theme_minimal()+
@@ -87,7 +101,8 @@ ggplot(sarcoma, aes(x=Age, y=Male.Rates))+
 # Examine this plot.  Is the model a good fit?
 # Is it good for any of the range?
 
-#################################################
+
+#################################################################################################################
 
 # Let's try and do this with a polynomial term or two: Age + Age^2 + Age^3
 # You can manually do it, or use the poly() function, which helps reduce correlation
@@ -101,7 +116,7 @@ summary(sarcoma_lm_poly)
 # Visualise this again with ggplot geom_smooth.  Add the argument: formula = "y ~ naturalSpline(x)", including your options
 
 ggplot(sarcoma, aes(x=Age, y=Male.Rates))+
-  geom_col(fill="dodgerblue2", col=1, alpha=0.5)+
+  geom_col(fill="mediumpurple4", col=1, alpha=0.5)+
   geom_smooth(col="red", method = "lm", formula = "y ~ poly(x,degree=4)") +
   labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+
   theme_minimal()+
@@ -139,7 +154,7 @@ summary(sarcoma_lm_spline2)
 # Visualise this again with ggplot geom_smooth.  Add the argument: formula = "y ~ naturalSpline(x)", including your options
 
 ggplot(sarcoma, aes(x=Age, y=Male.Rates))+
-  geom_col(fill="dodgerblue2", col=1, alpha=0.5)+
+  geom_col(fill="mediumpurple4", col=1, alpha=0.5)+
   geom_smooth(col="red", method = "lm", formula = "y ~ naturalSpline(x, df=4)") +
   labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+
   theme_minimal()+
@@ -147,7 +162,7 @@ ggplot(sarcoma, aes(x=Age, y=Male.Rates))+
 
 
 ggplot(sarcoma, aes(x=Age, y=Male.Rates))+
-  geom_col(fill="dodgerblue2", col=1, alpha=0.5)+
+  geom_col(fill="mediumpurple4", col=1, alpha=0.5)+
   geom_smooth(col="red", method = "lm", formula = "y ~ naturalSpline(x,  knots= c(24,49,79))") +
   labs(title="Incidence Rate of Osteosarcoma in UK Males, 2016 - 2018", subtitle="Source: Cancer Research UK")+
   theme_minimal()+
diff --git a/Scripts/exercise2_osteo_sarcoma_GAM.R b/Scripts/exercise2_osteo_sarcoma_GAM.R
new file mode 100644
index 0000000..e10663a
--- /dev/null
+++ b/Scripts/exercise2_osteo_sarcoma_GAM.R
@@ -0,0 +1,87 @@
+# Here is an example using Osteosarcoma, a type of bone cancer.
+# This is based on incidence data, per 100,00 population 2016-2018,
+# Published by Cancer Research UK:
+# https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/bone-sarcoma/incidence
+# The data have been slightly regrouped to aid the exercise.
+
+# Load the data:
+
+library(readr)
+library(ggplot2)
+sarcoma <- read_csv("./data/sarcoma.csv", name_repair = "universal")
+
+# Columns in the data are:
+# Age:            Upper number of 5-year age bands. e.g. 0 - 4, is coded as 4.  90+ is coded as 95
+#                 for the sake of this exercise.
+# Female.Cases:   Number of incident cases reported in females.
+# Male.Cases:     Number of incident cases reported in males.
+# Female.Rates:   Incidence Rate per 100,000 population, in this age group, for females
+# Male.Rates:     Incidence Rate per 100,000 population, in this age group, for males
+
+
+
+
+############################################################
+
+# Now let's fit it as a GAM using mgcv package.
+# Use the 's()' function to wrap a smoother around age.  We'll let it chose for use at first.
+
+library(mgcv)
+sarcoma_gam1 <- gam(Male.Rates ~ s(Age),data=sarcoma)
+
+summary(sarcoma_gam1)
+
+
+# Plot the results
+plot(sarcoma_gam1, residuals = TRUE, pch = 1)
+
+
+#### Can also use alternatives like `gratia` or mgcVis packages:
+
+# https://gavinsimpson.github.io/gratia/index.html
+gratia::draw(sarcoma_gam1, residuals = TRUE)
+
+library(mgcViz)
+# https://mfasiolo.github.io/mgcViz/articles/mgcviz.html
+b <- mgcViz::getViz(sarcoma_gam1)
+
+o <- plot( sm(b, 1) )
+o + l_fitLine(colour = "red") + l_rug(mapping = aes(x=x, y=y), alpha = 0.8) +
+  l_ciLine(mul = 5, colour = "blue", linetype = 2) +
+  l_points(shape = 19, size = 1, alpha = 0.1) + theme_classic()
+
+
+
+# Extract the model coefficients with `coef` function.  How many coefficients are there for basis functions that make up this smooth?
+coef(sarcoma_gam1)
+
+
+# Can you interpret any of those coefficients?
+# How do you know if they are 'significant'?
+
+summary(sarcoma_gam1)
+anova(sarcoma_gam1)
+
+# Both show the pooled significance of the whole smoother.
+
+# Predict and fit manually:
+# We'll create a colmn called 'gam1_preds' for the predictions in our original data frame
+
+sarcoma$gam1_preds <- predict(sarcoma_gam1, newdata = sarcoma)
+
+# also compare to predctions fomr our original model
+
+sarcoma$lm_preds <- predict(sarcoma_lm, newdata = sarcoma)
+sarcoma$lm_poly_preds <- predict(sarcoma_lm_poly, newdata = sarcoma)
+
+ggplot(sarcoma, aes(y=Male.Rates))+
+  geom_point(aes(x=gam1_preds), col="mediumpurple")+
+  geom_point(aes(x=lm_preds), col="mediumaquamarine")+
+  geom_point(aes(x=lm_poly_preds), col="orange2")+
+  geom_abline(intercept=0, slope=1, col="red")
+
+# In general, purple points (gam) are closer to the line (perfect prediction) compared to the regular lm.
+# Lets sum up the residual error, squaring it so +/- don't cancel.
+sum(residuals(sarcoma_lm)^2)
+sum(residuals(sarcoma_lm_poly)^2)
+sum(residuals(sarcoma_gam1)^2)
diff --git a/Solutions/exercise2_osteo_sarcoma.R b/Scripts/exercise3_osteo_sarcoma_gam_options.R
similarity index 96%
rename from Solutions/exercise2_osteo_sarcoma.R
rename to Scripts/exercise3_osteo_sarcoma_gam_options.R
index fecec40..b74b8b8 100644
--- a/Solutions/exercise2_osteo_sarcoma.R
+++ b/Scripts/exercise3_osteo_sarcoma_gam_options.R
@@ -85,7 +85,9 @@ plot(sarcoma_gam8, residuals = TRUE, pch = 1)
 plot(sarcoma_gam9, residuals = TRUE, pch = 1)
 
 # The higher penalties reduce the 'wiggliness'
-
+# Extreme:
+sarcoma_gam7a <- gam(Male.Rates ~ s(Age, bs="cr", k=12),data=sarcoma, sp=1e12)
+plot(sarcoma_gam7a, residuals = TRUE, pch = 1)
 
 ############################################################
 # Model fitting and diagnostics
@@ -119,3 +121,5 @@ check.gamViz(getViz(sarcoma_gam10))
 
 # mgcv has functions already for some of diagnostics
 qq.gam(sarcoma_gam10, pch=1)
+
+
diff --git a/Solutions/exercise3_sarcoma_and_los.R b/Scripts/exercise4_sarcoma_and_los_multiplevariable.R
similarity index 99%
rename from Solutions/exercise3_sarcoma_and_los.R
rename to Scripts/exercise4_sarcoma_and_los_multiplevariable.R
index b6d66ee..4f4ba73 100644
--- a/Solutions/exercise3_sarcoma_and_los.R
+++ b/Scripts/exercise4_sarcoma_and_los_multiplevariable.R
@@ -176,4 +176,6 @@ concurvity(gam_death2, full = TRUE)
 
 concurvity(gam_death2, full = FALSE)
 
+# Is there a problem with concuirvity here?
+
 
diff --git a/Scripts/exercise5_CHDRisk.R b/Scripts/exercise5_CHDRisk.R
new file mode 100644
index 0000000..7aca176
--- /dev/null
+++ b/Scripts/exercise5_CHDRisk.R
@@ -0,0 +1,39 @@
+### Examining Framingham data
+
+#The Framingham Heart disease study is a long-standing cohort study that is responsible for
+#much of what we know about heart diseases and associated conditions.  For more info, see:
+#https://en.wikipedia.org/wiki/Framingham_Heart_Study
+
+# We are going to use GAMs to investigate some of it:
+
+framingham <- read.csv("./data/framingham.csv")
+
+# The columns we have are:
+# male            0 = Female; 1 = Male
+# age             Age at examination time.
+# education       1 = Some High School; 2 = High School or GED;
+#                 3 = Some College or Vocational School; 4 = college
+# currentSmoker   0 = nonsmoker; 1 = smoker
+# cigsPerDay      number of cigarettes smoked per day (estimated average)
+# BPMeds          0 = Not on Blood Pressure medications; 1 = Is on Blood Pressure medications
+# prevalentStroke
+# prevalentHyp
+# diabetes        0 = No; 1 = Yes
+# totChol         mg/dL
+# sysBP           mmHg
+# diaBP           mmHg
+# BMI             Body Mass Index calculated as: Weight (kg) / Height(meter-squared)
+# heartRate       Beats/Min (Ventricular)
+# glucose         mg/dL
+# TenYearCHD      0 = No CHD at 10-year follow-up; 1 = CHD at 10-year follow-up
+
+
+# Try and build the best regression model you can:
+
+library(mgcv)
+
+CHDrisk <- gam(TenYearCHD ~  # enter your model terms here
+               , data = framingham
+               , family = "binomial"
+               , metho = "REML")
+
diff --git a/xaringan-themer.css b/xaringan-themer.css
index 44d149c..fa0c614 100644
--- a/xaringan-themer.css
+++ b/xaringan-themer.css
@@ -159,7 +159,9 @@ img, video, iframe {
 }
 blockquote {
   border-left: solid 5px #23395b80;
+  background-color: #e0e0e0 ;
   padding-left: 1em;
+  padding: 0.05rem;
 }
 .remark-slide table {
   margin: auto;