-
Notifications
You must be signed in to change notification settings - Fork 4
/
05-spending_our_data.Rmd
245 lines (164 loc) · 11.1 KB
/
05-spending_our_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
# Spending our data
**Learning objectives:**
- Use {rsample} to **split data into training and testing sets.**
- Identify cases where **stratified sampling** is useful.
- Understand the **difference** between `rsample::initial_time_split()` and `rsample::initial_split()`.
- Understand the **trade-offs** between too little training data and too little testing data.
- Define a **validation set** of data.
- Explain why data should be split at the **independent experimental unit** level.
## Spending our data
The task of creating a useful model can be daunting. Thankfully, one can do so step-by-step. It can be helpful to sketch out your path, as Chanin Nantasenamat has done so:
![](images/step_by_step_ml.jpg)
We're going to zoom into the data splitting part. As the diagram shows, it is one of the earliest considerations in a model building workflow. The **training set** is the data that the model(s) learns from. It's usually the majority of the data (~ 80-70% of the data), and you'll be spending the bulk of your time working on fitting models to it.
The **test set** is the data set aside for unbiased model validation once a candidate model(s) has been chosen. Unlike the training set, the test set is only looked at once.
Why is it important to think about data splitting? You could do everything right, from cleaning the data, collecting features and picking a great model, but get bad results when you test the model on data it hasn't seen before. If you're in this predicament, the data splitting you've employed may be worth further investigation.
## Common methods for splitting data
Choosing how to conduct the split of the data into training and test sets may not be a trivial task. It depends on the data and the purpose.
The most common type of sampling is known as random sampling and it is done readily in R using the [rsample](https://rsample.tidymodels.org) package with the `initial_split()`function. For the [Ames housing dataset](https://www.tmwr.org/ames.html), the call would be:
```{r setup-05, message=FALSE}
library(tidymodels)
set.seed(123)
data(ames)
ames_split <- initial_split(ames, prop = 0.80)
ames_split
```
The object `ames_split` is an `rsplit` object. To get the training and test results you can call on `training()` and `test()`:
```{r train-test-05}
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
```
## Class imbalance
In many instances, random splitting is not suitable. This includes datasets that contain *class imbalance*, meaning one class is dominated by another. Class imbalance is important to detect and take into consideration in data splitting. Performing random splitting on a dataset with severe class imbalance may cause the model to perform badly at validation. You want to avoid allocating the minority class disproportionately into the training or test set. The point is to have the same distribution across the training and test sets. Class imbalance can occur in differing degrees:
![](images/class_imbalance.png)
Splitting methods suited for datasets containing class imbalance should be considered. Let's consider a #Tidytuesday dataset on [Himalayan expedition members](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-09-22/readme.md), which Julia Silge recently explored [here](https://juliasilge.com/blog/himalayan-climbing/) using **{tidymodels}**.
```{r skim-05, message=FALSE}
library(tidyverse)
library(skimr)
members <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
skim(members)
```
Let's say we were interested in predicting the likelihood of survival or death for an expedition member. It would be a good idea to check for class imbalance:
```{r janitor-05, message=FALSE}
library(janitor)
members %>%
tabyl(died) %>%
adorn_totals("row")
```
We can see that nearly 99% of people survive their expedition. This dataset would be ripe for a sampling technique adept at handling such extreme class imbalance. This technique is called *stratified sampling*, in which "the training/test split is conducted separately within each class and then these subsamples are combined into the overall training and test set". Operationally, this is done by using the `strata` argument inside `initial_split()`:
```{r more-split-05}
set.seed(123)
members_split <- initial_split(members, prop = 0.80, strata = died)
members_train <- training(members_split)
members_test <- testing(members_split)
```
### Stratified sampling simulation
With simulation we can see the effect of stratification: we expect that the expected value does not change with stratification but the variance is lower.
```{r simulate-stratified}
simulate_stratified_sampling <- function(prop_in_dataset, n_resample = 50,
n_rows = 1000, seed = 45678) {
set.seed(seed)
data_to_split <- tibble(died = c(
rep(1, floor(n_rows * prop_in_dataset)),
rep(0, floor(n_rows * (1 - prop_in_dataset)))
))
samples <- map_dfr(seq_len(n_resample), ~{
initial_split(data_to_split) %>%
testing() %>%
summarize(died_pct = mean(died))
})
stratified_samples <- map_dfr(seq_len(n_resample), ~{
initial_split(data_to_split, strata = died) %>%
testing() %>%
summarize(died_pct = mean(died))
})
rbind(
samples %>% mutate(stratified = FALSE),
stratified_samples %>% mutate(stratified = TRUE)
) %>%
group_by(stratified) %>%
summarize(mean = mean(died_pct), var = var(died_pct))
}
```
rsample does not stratify if class imbalance is [more extreme than 10%](https://rsample.tidymodels.org/reference/initial_split.html#arguments)
```{r stratified-b}
simulate_stratified_sampling(0.09)
```
Stratified sampling happens:
```{r stratified-c}
simulate_stratified_sampling(0.11)
```
## Continuous outcome data
For continuous outcome data (e.g. costs), a stratified random sampling approach would involve conducting a 80/20 split within each quartile and then pool the results together. For the [Ames housing dataset](https://www.tmwr.org/ames.html), the call would look like this:
```{r more-split-b-05}
set.seed(123)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
```
## Time series data
For time series data where you'd want to allocate data to the training set/test set depending on a sorted order, you can use `initial_time_split()` which works similarly to `initial_split()`. The `prop` argument can be used to specify what proportion of the first part of data should be used as the training set.
```{r time-train-test-05}
data(drinks)
drinks_split <- initial_time_split(drinks)
train_data <- training(drinks_split)
test_data <- testing(drinks_split)
```
The `lag` argument can specify a lag period to use between the training and test set. This is useful if lagged predictors will be used during training and testing.
```{r lag-train-test-05}
drinks_lag_split <- initial_time_split(drinks, lag = 12)
train_data_lag <- training(drinks_lag_split)
test_data_lag <- testing(drinks_lag_split)
c(max(train_data_lag$date), min(test_data_lag$date))
```
## Multi-level data
It's important to figure out what the **independent experimental unit** is in your data. In the Ames dataset, there is one row per house and so houses and their properties are considered to be independent of one another.
In other datasets, there may be multiple rows per experimental unit (e.g. as in patients who are measured multiple times across time). This has implications for data splitting. To avoid data from the same experimental unit being in both the training and test set, split along the independent experimental units such that X% of experimental units are in the training set.
Example:
```{r heights-05}
# data source: http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#oxboys
child_heights <- read_delim(here::here("data/Oxboys.txt"), col_names = FALSE, delim = " ") %>%
purrr::set_names(c("child_id", "age_norm", "height", "measurement_id", "season"))
head(child_heights)
```
Depending on the modeling problem we may want to split the data to train and test set in a way that data points for children remain together. You can do this with the following code.
```{r heights-split-05}
child_heights_split <- child_heights %>%
group_nest(child_id) %>%
initial_split()
child_heights_train <- training(child_heights_split) %>%
unnest(data)
```
For other task it might be more suitable to split along measurement id and all childrens' last measurement will be the test set.
## What proportion should be used?
https://twitter.com/asmae_toumi/status/1356024351720689669?s=20
![](images/provocative_8020.jpg)
Some people said the 80/20 split comes from the [Pareto principle/distribution](https://en.wikipedia.org/wiki/Pareto_principle) or the [power law](https://en.wikipedia.org/wiki/Power_law). Some said because it works nicely with 5-fold cross-validation (which we will see in the later chapters).
![](images/train_test.png)
I believe the point is to use enough data in the training set to allow for solid parameter estimation but not too much that it hurts performance. 80/20 or 70/30 seems reasonable for most problems at hand, as it's what is widely used. Max Kuhn notes that a test set is almost always a good idea, and it should only be avoided when the data is "pathologically small".
## Summary
Data splitting is an important part of a modeling workflow as it impacts model validity and performance. The most common splitting technique is random splitting. Some data, such as time-series or multi-level data require a different data splitting technique called stratified sampling. The `rsample` package contains many functions that can perform random splitting and stratified splitting.
We will learn more about how to remedy certain issues such as class imbalance, bias and overfitting in Chapter 10.
### References
- Tidy modeling with R by Max Kuhn and Julia Silge: https://www.tmwr.org/splitting.html
- Feature Engineering and Selection: A Practical Approach for Predictive Models by Max Kuhn and Kjell Johnson: https://bookdown.org/max/FES/
- Handle class imbalance in #TidyTuesday climbing expedition data with tidymodels: https://juliasilge.com/blog/himalayan-climbing/
- Data preparation and feature engineering for machine learning: https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data
- How to Build a Machine Learning Model by Chanin Nantasenamat: https://towardsdatascience.com/how-to-build-a-machine-learning-model-439ab8fb3fb1
## Videos de las reuniones
### Cohorte 1
`r knitr::include_url("https://www.youtube.com/embed/_E_5pBFtSJk")`
<details>
<summary> Chat de la reunión </summary>
```
00:46:44 Esmeralda: Libro: https://www.tmwr.org
00:46:58 Esmeralda: Website (notas sobre el libro): https://r4ds.github.io/bookclub-tmwr_es/
```
</details>
`r knitr::include_url("https://www.youtube.com/embed/0FFbfC1t0RE")`
`r knitr::include_url("https://www.youtube.com/embed/bwx4AILEEaM")`
<details>
<summary> Chat de la reunión </summary>
```
00:05:00 Esme: I had forgotten that this week we in UK are an hour ahead Mexico time due to the time change.
```
</details>