Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature LOF outlier detector #746

Merged
merged 285 commits into from
Jun 12, 2023
Merged

Feature LOF outlier detector #746

merged 285 commits into from
Jun 12, 2023

Conversation

mauicv
Copy link
Collaborator

@mauicv mauicv commented Feb 21, 2023

What is this

Implementation of lof outlier detector. See this notebook for example. Branches off gmm-od.

TODO:

  • pytorch backend
  • docs
  • tests
  • example notebook
  • mypy
  • flake8
  • Optional deps
  • Merge kNN fit/infer_threshold changes

@mauicv mauicv requested review from ojcobb and ascillitoe May 24, 2023 15:27
values then the score method uses the distance/kernel similarity to each of the specified `k` neighbors.
In the latter case, an `aggregator` must be specified to aggregate the scores.

Note that, in the multiple k case, a normalizer can be provided. If a normalizer is passed then it is fit in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be a little clearer as to what the normalizer and aggregator refer to as it's not clear here or from the kwarg descriptions. I realise this applied to KNN too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've opened an issue here. I'll include it in a final clean up PR i think

def score(self, x: np.ndarray) -> np.ndarray:
"""Score `x` instances using the detector.

Computes the local outlier factor for each instance in `x`. If `k` is an array of values then the score for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth just noting here that the outlier factor is the density of each instance in x relative to those of its neighbours in x_ref.

return mask

def _compute_K(self, x, y):
"""Compute the distance/similarity matrix matrix between `x` and `y`."""
Copy link
Contributor

@ojcobb ojcobb May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove "/similarity" here? A similarity matrix would have entries that icnrease with similarity, whereas this is the opposite

@ojcobb
Copy link
Contributor

ojcobb commented May 26, 2023

Ran through all the code and couldn't find anything worth commenting on! Just the nitpicks on docstrings above. Nice!

@ascillitoe
Copy link
Contributor

couldn't find anything worth commenting on

Challenge accepted 😛


# set metadata
self.meta['detector_type'] = 'outlier'
self.meta['data_type'] = 'numeric'
Copy link
Contributor

@ascillitoe ascillitoe May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not isolated to this PR, but noting that we seem to be a little inconsistent across the new and old outlier detectors wrt to when data_type is hard-coded, and when it is optionally set via a kwarg. For some, it is hardcoded to time-series (which makes sense), for some (e.g. the old Mahalanobis) it is set via kwarg, and for some it is hard coded to numeric. Maybe worth opening an issue to review this more generally?

Already mentioned in #567 (comment), but highlighting here since we are setting data_type in new detectors too...


Returns
-------
Outlier scores. The shape of the scores is `(n_instances,)`. The higher the score, the more anomalous the \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: In a few places outlying is used e.g.

the l2-norm of the projected data. The higher the score, the more outlying the instance.

whereas in the score docstring (and for _pca, _gmm, _knn, mahalanobis) anomalous is used. Worth picking one or the other?

assert torch.all(lof_torch(x) == torch.tensor([0, 0, 1]))


@pytest.mark.skip(reason="Can't convert GaussianRBF to torch script due to torch script type constraints")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intention behind including this test if it is skipped in all cases?

Copy link
Collaborator Author

@mauicv mauicv Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it then decided not to include this functionality in the first set of outliers, however, it should be implemented at some point which is why I've left it in but skipped it. I can take it out if you prefer?

see #810

Copy link
Contributor

@ascillitoe ascillitoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments, otherwise LGTM!

@mauicv mauicv merged commit 5e69f4b into SeldonIO:master Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: New method New method proposals
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants