Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.0.9 #509

Merged
merged 13 commits into from
Jun 25, 2023
1 change: 1 addition & 0 deletions CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -177,3 +177,4 @@ v<1.0.8>, <03/08/2023> -- Improve clone compatibility (#471).
v<1.0.8>, <03/08/2023> -- Add QMCD detector (#452).
v<1.0.8>, <03/08/2023> -- Optimized ECDF and drop Statsmodels dependency (#467).
v<1.0.9>, <03/19/2023> -- Hot fix for errors in ECOD and COPOD due to the issue of scipy.
v<1.0.9>, <06/19/2023> -- Further integration of PyThresh.
60 changes: 57 additions & 3 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ NeurIPS 2022 paper `ADBench: Anomaly Detection Benchmark Paper <https://www.andr
* `ADBench Benchmark <#adbench-benchmark>`_
* `Model Save & Load <#model-save--load>`_
* `Fast Train with SUOD <#fast-train-with-suod>`_
* `Thresholding Outlier Scores <#thresholding-outlier-scores>`_
* `Implemented Algorithms <#implemented-algorithms>`_
* `Quick Start for Outlier Detection <#quick-start-for-outlier-detection>`_
* `How to Contribute <#how-to-contribute>`_
Expand Down Expand Up @@ -327,7 +328,25 @@ and `SUOD example <https://github.com/yzhao062/pyod/blob/master/examples/suod_e
clf = SUOD(base_estimators=detector_list, n_jobs=2, combination='average',
verbose=False)
----

Thresholding Outlier Scores
^^^^^^^^^^^^^^^^^^^^^^^^^^^

A more data based approach can be taken when setting the contamination level.
By using a thresholding method, guessing an abritrary value can be replaced
with tested techniques for seperating inliers and outliers. Refer to
`PyThresh <https://github.com/KulikDM/pythresh>`_ for
a more in depth look at thresholding.


.. code-block:: python
from pyod.models.knn import KNN
from pyod.models.thresholds import FILTER
# Set the outlier detection and thresholding methods
clf = KNN(contamination=FILTER())
----
Expand All @@ -337,7 +356,7 @@ and `SUOD example <https://github.com/yzhao062/pyod/blob/master/examples/suod_e
Implemented Algorithms
^^^^^^^^^^^^^^^^^^^^^^

PyOD toolkit consists of three major functional groups:
PyOD toolkit consists of four major functional groups:

**(i) Individual Detection Algorithms** :

Expand Down Expand Up @@ -411,8 +430,43 @@ Combination Median Simple combination by taking the median o
Combination majority Vote Simple combination by taking the majority vote of the labels (weights can be used) 2015 [#Aggarwal2015Theoretical]_
=================== ================ ===================================================================================================== ===== ========================================


**(iii) Utility Functions**:
**(iii) Outlier Detection Score Thresholding Methods**:

================================== ================ ================================================================ ====================================================================================================================
Type Abbr Algorithm Documentation
================================== ================ ================================================================ ====================================================================================================================
Kernel-Based AUCP Area Under Curve Percentage `AUCP <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.AUCP>`_
Statistical Moment-Based BOOT Bootstrapping `BOOT <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.BOOT>`_
Normality-Based CHAU Chauvenet's Criterion `CHAU <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.CHAU>`_
Linear Model CLF Trained Linear Classifier `CLF <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.CLF>`_
cluster-Based CLUST Clustering Based `CLUST <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.CLUST>`_
Kernel-Based CPD Change Point Detection `CPD <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.CPD>`_
Transformation-Based DECOMP Decomposition `DECOMP <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.DECOMP>`_
Normality-Based DSN Distance Shift from Normal `DSN <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.DSN>`_
Curve-Based EB Elliptical Boundary `EB <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.EB>`_
Kernel-Based FGD Fixed Gradient Descent `FGD <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.FGD>`_
Filter-Based FILTER Filtering Based `FILTER <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.FILTER>`_
Curve-Based FWFM Full Width at Full Minimum `FWFM <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.FWFM>`_
Statistical Test-Based GESD Generalized Extreme Studentized Deviate `GESD <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.GESD>`_
Filter-Based HIST Histogram Based `HIST <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.HIST>`_
Quantile-Based IQR Inter-Quartile Region `IQR <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.IQR>`_
Statistical Moment-Based KARCH Karcher mean (Riemannian Center of Mass) `KARCH <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.KARCH>`_
Statistical Moment-Based MAD Median Absolute Deviation `MAD <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.MAD>`_
Statistical Test-Based MCST Monte Carlo Shapiro Tests `MCST <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.MCST>`_
Ensembles-Based META Meta-model Trained Classifier `META <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.META>`_
Transformation-Based MOLL Friedrichs' Mollifier `MOLL <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.MOLL>`_
Statistical Test-Based MTT Modified Thompson Tau Test `MTT <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.MTT>`_
Linear Model OCSVM One-Class Support Vector Machine `OCSVM <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.OCSVM>`_
Quantile-Based QMCD Quasi-Monte Carlo Discrepancy `QMCD <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.QMCD>`_
Linear Model REGR Regression Based `REGR <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.REGR>`_
Neural Networks VAE Variational Autoencoder `VAE <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.VAE>`_
Curve-Based WIND Topological Winding Number `WIND <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.WIND>`_
Transformation-Based YJ Yeo-Johnson Transformation `YJ <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.YJ>`_
Normality-Based ZSCORE Z-score `ZSCORE <https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.thresholds.ZSCORE>`_
================================== ================ ================================================================ ====================================================================================================================


**(iV) Utility Functions**:

=================== ====================== ===================================================================================================================================================== ======================================================================================================================================
Type Name Function Documentation
Expand Down
4 changes: 2 additions & 2 deletions docs/about.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ About us
Core Development Team
---------------------

Yue Zhao (Ph.D. Student @ Carnegie Mellon University):
Yue Zhao (Assistant Professor @ USC, Ph.D. @ CMU):

- Initialized the project in 2017
- `Homepage <https://www.andrew.cmu.edu/user/yuezhao2/>`_
- `Homepage <https://viterbi-web.usc.edu/~yzhao010/>`_
- `LinkedIn (Yue Zhao) <https://www.linkedin.com/in/yzhao062/>`_

Zain Nasrullah (Data Scientist at RBC; MSc in Computer Science from University of Toronto):
Expand Down
39 changes: 39 additions & 0 deletions docs/example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,45 @@ please navigate to **"/notebooks/Model Combination.ipynb"**
Combination by AOM ROC:0.9257, precision @ rank n:0.4844
Combination by MOA ROC:0.9263, precision @ rank n:0.4688
Thresholding Example
--------------------


Full example: `threshold_example.py <https://github.com/yzhao062/Pyod/blob/master/examples/threshold_example.py>`_

1. Import models

.. code-block:: python
from pyod.models.knn import KNN # kNN detector
from pyod.models.thresholds import FILTER # Filter thresholder
2. Generate sample data with :func:`pyod.utils.data.generate_data`:

.. code-block:: python
contamination = 0.1 # percentage of outliers
n_train = 200 # number of training points
n_test = 100 # number of testing points
X_train, X_test, y_train, y_test = generate_data(
n_train=n_train, n_test=n_test, contamination=contamination)
3. Initialize a :class:`pyod.models.knn.KNN` detector, fit the model, and make
the prediction.

.. code-block:: python
# train kNN detector and apply FILTER thresholding
clf_name = 'KNN'
clf = KNN(contamination=FILTER())
clf.fit(X_train)
# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
.. rubric:: References

.. bibliography::
Expand Down
9 changes: 9 additions & 0 deletions docs/pyod.models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -367,6 +367,15 @@ pyod.models.suod module
:show-inheritance:
:inherited-members:

pyod.models.thresholds module
-----------------------

.. automodule:: pyod.models.thresholds
:members:
:undoc-members:
:show-inheritance:
:inherited-members:

pyod.models.vae module
----------------------

Expand Down
4 changes: 3 additions & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,10 @@ matplotlib
nose
numpy>=1.19
numba==0.53 # need to lift this later see github for issue
pyclustering
pytest
pythresh
pythresh>=0.3.1
ruptures
scipy>=1.5.1
scikit_learn>=0.20.0
scikit-lego
Expand Down
9 changes: 4 additions & 5 deletions examples/cd_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
sys.path.append(
os.path.abspath(os.path.join(os.path.dirname("__file__"), '..')))

import numpy as np
from pyod.models.cd import CD
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
Expand All @@ -30,22 +29,22 @@
X_train, X_test, y_train, y_test = \
generate_data(n_train=n_train,
n_test=n_test,
n_features=2,
n_features=5,
contamination=contamination,
random_state=42)

# train HBOS detector
clf_name = 'CD'
clf = CD()
clf.fit(X_train, y_train)
clf.fit(X_train)

# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores

# get the prediction on the test data
y_test_pred = clf.predict(np.append(X_test, y_test.reshape(-1,1), axis=1)) # outlier labels (0 or 1)
y_test_scores = clf.decision_function(np.append(X_test, y_test.reshape(-1,1), axis=1)) # outlier scores
y_test_pred = clf.predict(X_test) # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test) # outlier scores

# evaluate and print the results
print("\nOn Training Data:")
Expand Down
4 changes: 2 additions & 2 deletions examples/threshold_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
os.path.abspath(os.path.join(os.path.dirname("__file__"), '..')))

from pyod.models.kde import KDE
from pyod.models.thresholds import ALL
from pyod.models.thresholds import FILTER
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize
Expand All @@ -36,7 +36,7 @@

# train KDE detector
clf_name = 'KDE'
clf = KDE(contamination=ALL())
clf = KDE(contamination=FILTER())
clf.fit(X_train)

# get the prediction labels and outlier scores of the training data
Expand Down
1 change: 0 additions & 1 deletion pyod/models/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,6 @@ def predict(self, X, return_confidence=False):
# if this is a PyThresh object
else:
prediction = self.contamination.eval(pred_score)
print(self.contamination)

if return_confidence:
confidence = self.predict_confidence(X)
Expand Down
Loading