-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature LOF outlier detector #746
Changes from all commits
09412c4
65fbeac
633636a
4cd2eba
98d2ba5
3d57b6d
bc01ddf
47566b8
bda15fd
bacbfcc
ea436b0
c3af856
351e5a7
a190cc3
0daa98f
8d2fa3b
af8ab43
89c4815
c909f4c
bc7a771
dfd613d
3fb167b
f28cb0f
6ef8c62
76b31d7
6fb600c
cf6bba1
f8ae93d
96f9be0
fa4f944
e936b36
d297082
1bf5fdc
a6ce6a2
babc083
ce28b8e
4830db3
7fe67f1
789a1d3
55e95d6
fe7c185
ada292e
31dbed1
2d7bdeb
f86ba80
7bc86dc
c865d4c
66a5826
3e9877c
352e788
8c5668f
4161dff
f357bbe
40a0b27
a831642
aa334bd
d06673f
0c674f0
af30234
616ece2
db524ab
3d69469
5934ef9
d7a9b82
6391c80
35afa55
2fad9e3
c679ab0
7826155
901ad53
9439f38
a1b0a0d
d0bb206
09f8faa
dc8d8bb
27230e3
d64c6b9
5c56dc2
170a55f
5a92bb0
90318e2
8e7762c
06340c5
7b939fa
0f415b6
049f93f
c1a2ea8
885201f
8667af5
fb7584a
cddb470
af9730e
9105843
4196834
5ade8e7
27f7f90
cae15ac
89aece2
2899c90
6af7d37
04da0a2
a677e30
f54586a
5bede8a
57dfd9d
b8c08eb
b227db4
705bc95
e48a5e1
2ba3a46
d6a8486
94dd9c8
7382a01
c50cb47
e53d4d1
c86d0ad
46bb747
a3e89e3
b08572d
70921b7
87c8ab7
21a5fc0
75a7fb6
f4a13d3
186cde2
adb1f51
789b922
bdaf4d4
d539063
7633d8b
fc713c3
1e3d4d8
ea20b21
b1302dd
09cf9f7
37025ed
b48b0a0
8e423a0
e0deeaa
0a4046a
c4c7004
ffffbe8
02657ec
5b754cd
912afbd
d880f92
8d98ee2
95d0e3c
12d22f7
767aff9
467aa6a
d57d1bd
a535e8d
8a8e212
176a290
58d4f21
1f5be2f
b573bea
dc0f2a8
004a19c
a567b93
f9b023f
0cbe044
eecaa4b
858d8ec
8d10545
70dc0c6
76778c9
6fad014
8245edb
57abc3d
f09979d
c169089
e11bf5b
8dd6085
35a27c3
3fe923c
4061365
54aecfd
e3de2a4
b415372
e73cbdc
94e95e4
f5c67cc
67cb670
c5631eb
e4a1798
7378f69
944d5ef
278b619
83d8346
189d3cc
fb6bb18
e2d6224
cd99ec2
2cb7796
3ccca07
d2175dd
abd270d
90bc9fc
e329b90
665c8ae
7eefb93
c4b3719
3ea402d
3ceffbc
ef5a094
a4c06f5
5857c13
30eff38
9994169
49670aa
b93dcb7
4e5aa4b
3481e0c
e43dc34
2cc13b1
81b7c77
ea5e117
4fb2131
482e738
6dde34c
14b7175
f9326ed
95645c6
85fa952
0cb20d7
a38288e
e75058b
246def0
1d98e80
dd298f8
95f1006
f3e4d15
e13b9a7
b75eb04
37ab4e8
2af6641
6409bb7
023b4c3
a88630a
c13572d
6000309
3f091ad
d52c2f6
b7b4a66
b8842bd
5cec76d
32de716
7610baa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,217 @@ | ||||
from typing import Callable, Union, Optional, Dict, Any, List, Tuple | ||||
from typing import TYPE_CHECKING | ||||
from typing_extensions import Literal | ||||
|
||||
import numpy as np | ||||
|
||||
from alibi_detect.base import outlier_prediction_dict | ||||
from alibi_detect.exceptions import _catch_error as catch_error | ||||
from alibi_detect.od.base import TransformProtocol, TransformProtocolType | ||||
from alibi_detect.base import BaseDetector, FitMixin, ThresholdMixin | ||||
from alibi_detect.od.pytorch import LOFTorch, Ensembler | ||||
from alibi_detect.od.base import get_aggregator, get_normalizer, NormalizerLiterals, AggregatorLiterals | ||||
from alibi_detect.utils.frameworks import BackendValidator | ||||
from alibi_detect.version import __version__ | ||||
|
||||
|
||||
if TYPE_CHECKING: | ||||
import torch | ||||
|
||||
|
||||
backends = { | ||||
'pytorch': (LOFTorch, Ensembler) | ||||
} | ||||
|
||||
|
||||
class LOF(BaseDetector, FitMixin, ThresholdMixin): | ||||
def __init__( | ||||
self, | ||||
k: Union[int, np.ndarray, List[int], Tuple[int]], | ||||
kernel: Optional[Callable] = None, | ||||
normalizer: Optional[Union[TransformProtocolType, NormalizerLiterals]] = 'PValNormalizer', | ||||
aggregator: Union[TransformProtocol, AggregatorLiterals] = 'AverageAggregator', | ||||
backend: Literal['pytorch'] = 'pytorch', | ||||
device: Optional[Union[Literal['cuda', 'gpu', 'cpu'], 'torch.device']] = None, | ||||
) -> None: | ||||
""" | ||||
Local Outlier Factor (LOF) outlier detector. | ||||
|
||||
The LOF detector is a non-parametric method for outlier detection. It computes the local density | ||||
deviation of a given data point with respect to its neighbors. It considers as outliers the | ||||
samples that have a substantially lower density than their neighbors. | ||||
|
||||
The detector can be initialized with `k` a single value or an array of values. If `k` is a single value then | ||||
the score method uses the distance/kernel similarity to the k-th nearest neighbor. If `k` is an array of | ||||
values then the score method uses the distance/kernel similarity to each of the specified `k` neighbors. | ||||
In the latter case, an `aggregator` must be specified to aggregate the scores. | ||||
|
||||
Note that, in the multiple k case, a normalizer can be provided. If a normalizer is passed then it is fit in | ||||
the `infer_threshold` method and so this method must be called before the `predict` method. If this is not | ||||
done an exception is raised. If `k` is a single value then the predict method can be called without first | ||||
calling `infer_threshold` but only scores will be returned and not outlier predictions. | ||||
|
||||
Parameters | ||||
---------- | ||||
k | ||||
Number of nearest neighbors to compute distance to. `k` can be a single value or | ||||
an array of integers. If an array is passed, an aggregator is required to aggregate | ||||
the scores. If `k` is a single value we compute the local outlier factor for that `k`. | ||||
Otherwise if `k` is a list then we compute and aggregate the local outlier factor for each | ||||
value in `k`. | ||||
kernel | ||||
Kernel function to use for outlier detection. If ``None``, `torch.cdist` is used. | ||||
Otherwise if a kernel is specified then instead of using `torch.cdist` the kernel | ||||
defines the k nearest neighbor distance. | ||||
normalizer | ||||
Normalizer to use for outlier detection. If ``None``, no normalization is applied. | ||||
For a list of available normalizers, see :mod:`alibi_detect.od.pytorch.ensemble`. | ||||
aggregator | ||||
Aggregator to use for outlier detection. Can be set to ``None`` if `k` is a single | ||||
value. For a list of available aggregators, see :mod:`alibi_detect.od.pytorch.ensemble`. | ||||
backend | ||||
Backend used for outlier detection. Defaults to ``'pytorch'``. Options are ``'pytorch'``. | ||||
device | ||||
Device type used. The default tries to use the GPU and falls back on CPU if needed. | ||||
Can be specified by passing either ``'cuda'``, ``'gpu'``, ``'cpu'`` or an instance of | ||||
``torch.device``. | ||||
|
||||
Raises | ||||
------ | ||||
ValueError | ||||
If `k` is an array and `aggregator` is None. | ||||
NotImplementedError | ||||
If choice of `backend` is not implemented. | ||||
""" | ||||
super().__init__() | ||||
|
||||
backend_str: str = backend.lower() | ||||
BackendValidator( | ||||
backend_options={'pytorch': ['pytorch']}, | ||||
construct_name=self.__class__.__name__ | ||||
).verify_backend(backend_str) | ||||
|
||||
backend_cls, ensembler_cls = backends[backend] | ||||
ensembler = None | ||||
|
||||
if aggregator is None and isinstance(k, (list, np.ndarray, tuple)): | ||||
raise ValueError('If `k` is a `np.ndarray`, `list` or `tuple`, ' | ||||
'the `aggregator` argument cannot be ``None``.') | ||||
|
||||
if isinstance(k, (list, np.ndarray, tuple)): | ||||
ensembler = ensembler_cls( | ||||
normalizer=get_normalizer(normalizer), | ||||
aggregator=get_aggregator(aggregator) | ||||
) | ||||
|
||||
self.backend = backend_cls(k, kernel=kernel, ensembler=ensembler, device=device) | ||||
|
||||
# set metadata | ||||
self.meta['detector_type'] = 'outlier' | ||||
self.meta['data_type'] = 'numeric' | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not isolated to this PR, but noting that we seem to be a little inconsistent across the new and old outlier detectors wrt to when Already mentioned in #567 (comment), but highlighting here since we are setting |
||||
self.meta['online'] = False | ||||
|
||||
def fit(self, x_ref: np.ndarray) -> None: | ||||
"""Fit the detector on reference data. | ||||
|
||||
Parameters | ||||
---------- | ||||
x_ref | ||||
Reference data used to fit the detector. | ||||
""" | ||||
self.backend.fit(self.backend._to_tensor(x_ref)) | ||||
|
||||
@catch_error('NotFittedError') | ||||
@catch_error('ThresholdNotInferredError') | ||||
def score(self, x: np.ndarray) -> np.ndarray: | ||||
"""Score `x` instances using the detector. | ||||
|
||||
Computes the local outlier factor for each point in `x`. This is the density of each point `x` | ||||
relative to those of its neighbors in `x_ref`. If `k` is an array of values then the score for | ||||
each `k` is aggregated using the ensembler. | ||||
|
||||
Parameters | ||||
---------- | ||||
x | ||||
Data to score. The shape of `x` should be `(n_instances, n_features)`. | ||||
ascillitoe marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
|
||||
Returns | ||||
------- | ||||
Outlier scores. The shape of the scores is `(n_instances,)`. The higher the score, the more anomalous the \ | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpick: In a few places
whereas in the |
||||
instance. | ||||
|
||||
Raises | ||||
------ | ||||
NotFittedError | ||||
If called before detector has been fit. | ||||
ThresholdNotInferredError | ||||
If k is a list and a threshold was not inferred. | ||||
""" | ||||
score = self.backend.score(self.backend._to_tensor(x)) | ||||
score = self.backend._ensembler(score) | ||||
return self.backend._to_numpy(score) | ||||
|
||||
@catch_error('NotFittedError') | ||||
def infer_threshold(self, x: np.ndarray, fpr: float) -> None: | ||||
"""Infer the threshold for the LOF detector. | ||||
|
||||
The threshold is computed so that the outlier detector would incorrectly classify `fpr` proportion of the | ||||
reference data as outliers. | ||||
|
||||
Parameters | ||||
---------- | ||||
x | ||||
Reference data used to infer the threshold. | ||||
fpr | ||||
False positive rate used to infer the threshold. The false positive rate is the proportion of | ||||
instances in `x` that are incorrectly classified as outliers. The false positive rate should | ||||
be in the range ``(0, 1)``. | ||||
|
||||
Raises | ||||
------ | ||||
ValueError | ||||
Raised if `fpr` is not in ``(0, 1)``. | ||||
NotFittedError | ||||
If called before detector has been fit. | ||||
""" | ||||
self.backend.infer_threshold(self.backend._to_tensor(x), fpr) | ||||
|
||||
@catch_error('NotFittedError') | ||||
@catch_error('ThresholdNotInferredError') | ||||
def predict(self, x: np.ndarray) -> Dict[str, Any]: | ||||
"""Predict whether the instances in `x` are outliers or not. | ||||
|
||||
Scores the instances in `x` and if the threshold was inferred, returns the outlier labels and p-values as well. | ||||
|
||||
Parameters | ||||
---------- | ||||
x | ||||
Data to predict. The shape of `x` should be `(n_instances, n_features)`. | ||||
|
||||
Returns | ||||
------- | ||||
Dictionary with keys 'data' and 'meta'. 'data' contains the outlier scores. If threshold inference was \ | ||||
performed, 'data' also contains the threshold value, outlier labels and p-vals . The shape of the scores is \ | ||||
`(n_instances,)`. The higher the score, the more anomalous the instance. 'meta' contains information about \ | ||||
the detector. | ||||
|
||||
Raises | ||||
------ | ||||
NotFittedError | ||||
If called before detector has been fit. | ||||
ThresholdNotInferredError | ||||
If k is a list and a threshold was not inferred. | ||||
""" | ||||
outputs = self.backend.predict(self.backend._to_tensor(x)) | ||||
output = outlier_prediction_dict() | ||||
output['data'] = { | ||||
**output['data'], | ||||
**self.backend._to_numpy(outputs) | ||||
} | ||||
output['meta'] = { | ||||
**output['meta'], | ||||
'name': self.__class__.__name__, | ||||
'detector_type': 'outlier', | ||||
'online': False, | ||||
'version': __version__, | ||||
} | ||||
return output |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,164 @@ | ||
from typing import Optional, Union, List, Tuple | ||
from typing_extensions import Literal | ||
import numpy as np | ||
import torch | ||
|
||
from alibi_detect.od.pytorch.ensemble import Ensembler | ||
from alibi_detect.od.pytorch.base import TorchOutlierDetector | ||
|
||
|
||
class LOFTorch(TorchOutlierDetector): | ||
def __init__( | ||
self, | ||
k: Union[np.ndarray, List, Tuple, int], | ||
kernel: Optional[torch.nn.Module] = None, | ||
ensembler: Optional[Ensembler] = None, | ||
device: Optional[Union[Literal['cuda', 'gpu', 'cpu'], 'torch.device']] = None, | ||
): | ||
"""PyTorch backend for LOF detector. | ||
|
||
Parameters | ||
---------- | ||
k | ||
Number of nearest neighbors used to compute the local outlier factor. `k` can be a single | ||
value or an array of integers. If `k` is a single value the score method uses the | ||
distance/kernel similarity to the `k`-th nearest neighbor. If `k` is a list then it uses | ||
the distance/kernel similarity to each of the specified `k` neighbors. | ||
kernel | ||
If a kernel is specified then instead of using `torch.cdist` the kernel defines the `k` nearest | ||
neighbor distance. | ||
ensembler | ||
If `k` is an array of integers then the ensembler must not be ``None``. Should be an instance | ||
of :py:obj:`alibi_detect.od.pytorch.ensemble.ensembler`. Responsible for combining | ||
multiple scores into a single score. | ||
device | ||
Device type used. The default tries to use the GPU and falls back on CPU if needed. | ||
Can be specified by passing either ``'cuda'``, ``'gpu'``, ``'cpu'`` or an instance of | ||
``torch.device``. | ||
""" | ||
TorchOutlierDetector.__init__(self, device=device) | ||
self.kernel = kernel | ||
self.ensemble = isinstance(k, (np.ndarray, list, tuple)) | ||
self.ks = torch.tensor(k) if self.ensemble else torch.tensor([k], device=self.device) | ||
self.ensembler = ensembler | ||
|
||
def forward(self, x: torch.Tensor) -> torch.Tensor: | ||
"""Detect if `x` is an outlier. | ||
|
||
Parameters | ||
---------- | ||
x | ||
`torch.Tensor` with leading batch dimension. | ||
|
||
Returns | ||
------- | ||
`torch.Tensor` of ``bool`` values with leading batch dimension. | ||
|
||
Raises | ||
------ | ||
ThresholdNotInferredError | ||
If called before detector has had `infer_threshold` method called. | ||
""" | ||
raw_scores = self.score(x) | ||
scores = self._ensembler(raw_scores) | ||
if not torch.jit.is_scripting(): | ||
self.check_threshold_inferred() | ||
preds = scores > self.threshold | ||
return preds | ||
|
||
def _make_mask(self, reachabilities: torch.Tensor): | ||
"""Generate a mask for computing the average reachability. | ||
|
||
If k is an array then we need to compute the average reachability for each k separately. To do | ||
this we use a mask to weight the reachability of each k-close neighbor by 1/k and the rest to 0. | ||
""" | ||
mask = torch.zeros_like(reachabilities[0], device=self.device) | ||
for i, k in enumerate(self.ks): | ||
mask[:k, i] = torch.ones(k, device=self.device)/k | ||
return mask | ||
|
||
def _compute_K(self, x, y): | ||
"""Compute the distance matrix matrix between `x` and `y`.""" | ||
return torch.exp(-self.kernel(x, y)) if self.kernel is not None else torch.cdist(x, y) | ||
|
||
def score(self, x: torch.Tensor) -> torch.Tensor: | ||
"""Computes the score of `x` | ||
|
||
Parameters | ||
---------- | ||
x | ||
The tensor of instances. First dimension corresponds to batch. | ||
|
||
Returns | ||
------- | ||
Tensor of scores for each element in `x`. | ||
|
||
Raises | ||
------ | ||
NotFittedError | ||
If called before detector has been fit. | ||
""" | ||
self.check_fitted() | ||
|
||
# compute the distance matrix between x and x_ref | ||
K = self._compute_K(x, self.x_ref) | ||
|
||
# compute k nearest neighbors for maximum k in self.ks | ||
max_k = torch.max(self.ks) | ||
bot_k_items = torch.topk(K, int(max_k), dim=1, largest=False) | ||
bot_k_inds, bot_k_dists = bot_k_items.indices, bot_k_items.values | ||
|
||
# To compute the reachabilities we get the k-distances of each object in the instances | ||
# k nearest neighbors. Then we take the maximum of their k-distances and the distance | ||
# to the instance. | ||
lower_bounds = self.knn_dists_ref[bot_k_inds] | ||
reachabilities = torch.max(bot_k_dists[:, :, None], lower_bounds) | ||
|
||
# Compute the average reachability for each instance. We use a mask to manage each k in | ||
# self.ks separately. | ||
mask = self._make_mask(reachabilities) | ||
avg_reachabilities = (reachabilities*mask[None, :, :]).sum(1) | ||
|
||
# Compute the LOF score for each instance. Note we don't take 1/avg_reachabilities as | ||
# avg_reachabilities is the denominator in the LOF formula. | ||
factors = (self.ref_inv_avg_reachabilities[bot_k_inds] * mask[None, :, :]).sum(1) | ||
lofs = (avg_reachabilities * factors) | ||
return lofs if self.ensemble else lofs[:, 0] | ||
|
||
def fit(self, x_ref: torch.Tensor): | ||
"""Fits the detector | ||
|
||
Parameters | ||
---------- | ||
x_ref | ||
The Dataset tensor. | ||
""" | ||
# compute the distance matrix | ||
K = self._compute_K(x_ref, x_ref) | ||
# set diagonal to max distance to prevent torch.topk from returning the instance itself | ||
K += torch.eye(len(K), device=self.device) * torch.max(K) | ||
|
||
# compute k nearest neighbors for maximum k in self.ks | ||
max_k = torch.max(self.ks) | ||
bot_k_items = torch.topk(K, int(max_k), dim=1, largest=False) | ||
bot_k_inds, bot_k_dists = bot_k_items.indices, bot_k_items.values | ||
|
||
# store the k-distances for each instance for each k. | ||
self.knn_dists_ref = bot_k_dists[:, self.ks-1] | ||
|
||
# To compute the reachabilities we get the k-distances of each object in the instances | ||
# k nearest neighbors. Then we take the maximum of their k-distances and the distance | ||
# to the instance. | ||
lower_bounds = self.knn_dists_ref[bot_k_inds] | ||
reachabilities = torch.max(bot_k_dists[:, :, None], lower_bounds) | ||
|
||
# Compute the average reachability for each instance. We use a mask to manage each k in | ||
# self.ks separately. | ||
mask = self._make_mask(reachabilities) | ||
avg_reachabilities = (reachabilities*mask[None, :, :]).sum(1) | ||
|
||
# Compute the inverse average reachability for each instance. | ||
self.ref_inv_avg_reachabilities = 1/avg_reachabilities | ||
|
||
self.x_ref = x_ref | ||
self._set_fitted() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be a little clearer as to what the normalizer and aggregator refer to as it's not clear here or from the kwarg descriptions. I realise this applied to KNN too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I've opened an issue here. I'll include it in a final clean up PR i think