-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
157 lines (112 loc) · 4.49 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# tidy.outliers
<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://www.tidyverse.org/lifecycle/#experimental) [![CRAN status](https://www.r-pkg.org/badges/version/tidy.outliers)](https://CRAN.R-project.org/package=tidy.outliers) [![Codecov test coverage](https://codecov.io/gh/brunocarlin/tidy.outliers/branch/main/graph/badge.svg)](https://codecov.io/gh/brunocarlin/tidy.outliers?branch=master) [![R-CMD-check](https://github.com/brunocarlin/tidy.outliers/workflows/R-CMD-check/badge.svg)](https://github.com/brunocarlin/tidy.outliers/actions)
<!-- badges: end -->
The goal of tidy.outliers is to allow for easy usage of many outliers removal methods, currently implemented are:
Simple methods:
- Univariate based function
- Mahalanobis distance
Model Methods:
- [lookout](https://github.com/Sevvandi/lookout)
- [h2o.extendedIsolationForest](https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/eif.html)
- [outForest](https://github.com/mayer79/outForest)
# What are outlier scores?
The package works on the principal that all basic step_outlier\_\* functions return an outlier "score" that can be used for filtering outliers where 0 is a very low outlier score and 1 is a very high outlier score, so you could filter, for example all rows where the outlier score is greater than .9.
# Installation
You can not yet install the released version of tidy.outliers from [CRAN](https://CRAN.R-project.org) with:
``` r
#install.packages("tidy.outliers")
```
And the development version from GitHub with:
```{r eval=FALSE}
# install.packages("devtools")
devtools::install_github("brunocarlin/tidy.outliers")
```
# Usage
## Load Libraries
```{r libraries, echo=TRUE, warning=FALSE,message=FALSE}
library(recipes)
library(tidy.outliers)
```
## Create a recipe for calculation the outlier scores
I keep the mpg as an example outcome since you should remove outlier from your outcome, you also shouldn't remove outlier from testing data so the default is to skip the steps of the package when predicting.
```{r recipe}
rec_obj <-
recipe(mpg ~ ., data = mtcars) |>
step_outliers_maha(all_numeric(), -all_outcomes()) |>
step_outliers_lookout(all_numeric(),-contains(r"(.outliers)"),-all_outcomes()) |>
prep(mtcars)
```
## Return scores
```{r bake_null}
bake(rec_obj,new_data = NULL) |>
select(contains(r"(.outliers)")) |>
arrange(.outliers_lookout |> desc())
```
# Example filtering based on scores
## Create recipe filtering outliers
```{r recipe remove}
rec_obj2 <-
recipe(mpg ~ ., data = mtcars) |>
step_outliers_maha(all_numeric(), -all_outcomes()) |>
step_outliers_lookout(all_numeric(),-contains(r"(.outliers)"),-all_outcomes()) |>
step_outliers_remove(contains(r"(.outliers)")) |>
prep(mtcars)
```
## Returns the filtered rows
We filtered one row from the dataset and and automatically removed the extra outlier columns.
```{r bake_null remove}
bake(rec_obj2,new_data = NULL) |> glimpse()
```
```{r}
mtcars |> glimpse()
```
## Investigate why some rows were filtered
And we can get which were the outliers and their score
```{r tidy remove}
tidy(rec_obj2,number = 3) |>
arrange(aggregation_results |> desc())
```
# Integration with tidymodels
The package was made to play nice with tune and friends from tidymodels check out the article on our [github pkgdown page](https://brunocarlin.github.io/tidy.outliers/articles/integration_tidymodels.html)!
# Next steps
Although it is possible to manually change the function of model parameters using the options argument it would be nice to add the option to tune those internal parameters as well.
So instead of this.
```{r}
rec_obj2 <-
recipe(mpg ~ ., data = mtcars) |>
step_outliers_outForest(
all_numeric(),
-all_outcomes(),
options = list(
impute_multivariate_control = list(
num.trees = 200
)
))
```
You would write something like this
```{r}
rec_obj2 <-
recipe(mpg ~ ., data = mtcars) |>
step_outliers_outForest(
all_numeric(),
-all_outcomes(),
options = list(
impute_multivariate_control = list(
num.trees = tune::tune('tree')
)
))
```
The main problem is that this would require manually going model by model and incorporating those arguments as tunable components.