-
Notifications
You must be signed in to change notification settings - Fork 0
/
readme.qmd
121 lines (96 loc) · 3.17 KB
/
readme.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
title: "Benchmark Sparsity Threshold Tidymodels"
format: gfm
---
## How to run simulation
1. Install {renv} package `pak::pak("renv")` and call `renv::restore()`. This will install the correct versions of packages.
2. `source("make_files.R")` to set up simulation files.
3. Call `make` from the terminal while being in `files/` directory.
4. `source("collect_results.R")` to create a results data set.
5. Run `quarto render readme.qmd` to update the readme with new results.
## Results
Loading packages and data.
```{r}
#| message: false
library(tidyverse)
eval_data <- read_rds("simulation_results.rds") |>
mutate(sparse_data = if_else(sparse_data, "sparse", "dense"))
```
`eval_data` data dictionary:
- `sparse_data`: Logical, whether data was encoded as a sparse tibble or not.
- `model`: Character, Which parsnip model was used.
- `n_numeric`: Numeric, Number of numeric columns. These columns are dense, meaning little to no 0 values.
- `n_counts`: Numeric, Number of counts columns. These columns are sparse, meaning almost all 0 values.
- `n_rows`: Numeric, Number of rows in the data set.
- `seed`: Numeric, seed value.
- `time`: Numeric, number of seconds it took to run `fit(wf_spec, data)`.
- `mem_alloc`: Numeric, amount of memory allocated when running `fit(wf_spec, data)`.
- `rmse`: Numeric, performance metric between predictions and true values.
Run-time by sparsity and encoding
```{r}
#| label: "sparsity-vs-time"
eval_data |>
ggplot(aes(sparsity, time, color = sparse_data)) +
geom_point(alpha = 0.25) +
theme_minimal() +
scale_x_continuous(labels = scales::percent) +
labs(
x = "sparsity (percentage of 0s)",
y = "time (seconds)",
color = "encoding"
)
```
Memory allocation by sparsity and encoding
```{r}
#| label: "sparsity-vs-mem_alloc"
eval_data |>
ggplot(aes(sparsity, mem_alloc, color = sparse_data)) +
geom_point(alpha = 0.25) +
theme_minimal() +
scale_x_continuous(labels = scales::percent) +
bench::scale_y_bench_bytes() +
labs(
x = "sparsity (percentage of 0s)",
y = "Memory Allocation",
color = "encoding"
)
```
Each model is made to predict on the training data, and calculate the `yardstick::rmse()`. This value is compared between using sparse encoding and dense encoding of the data to try to detect differences in model fits.
```{r}
#| label: "rsme-dense-vs-sparse"
rmse_tbl <- eval_data |>
select(sparse_data, model, n_numeric, n_counts, n_rows, seed, rmse) |>
pivot_wider(values_from = rmse, names_from = sparse_data,
names_prefix = "rmse_") |>
mutate(rmse_diff = rmse_sparse - rmse_dense, .before = everything())
rmse_tbl |>
ggplot(aes(rmse_sparse, rmse_dense)) +
geom_point() +
theme_minimal() +
labs(
title = "RMSE of Model with ___ encoding",
x = "sparse",
y = "dense"
)
```
There ar some runs that doesn't match performance
```{r}
#| label: "rmse_diff-count"
rmse_tbl |>
summarise(nonzero = sum(rmse_diff != 0) / n())
```
It happens for these models
```{r}
#| label: "rmse_diff-nonzero-counts"
rmse_tbl |>
filter(rmse_diff != 0) |>
count(model)
```
It happens for these models
```{r}
#| label: "rmse_diff-nonzero-plot"
rmse_tbl |>
filter(rmse_diff != 0) |>
ggplot(aes(rmse_diff)) +
geom_histogram()
```