GammaGMM

GammaGMM is a GitHub repository containing the gammaGMM [1] algorithm. It refers to the paper titled Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection.

Check out the pdf here: [pdf].

Abstract

Anomaly detection methods identify examples that do not follow the expected behaviour, typically in an unsupervised fashion, by assigning real-valued anomaly scores to the examples based on various heuristics. These scores need to be transformed into actual predictions by thresholding so that the proportion of examples marked as anomalies equals the expected proportion of anomalies, called contamination factor. Unfortunately, there are no good methods for estimating the contamination factor itself. We address this need from a Bayesian perspective, introducing a method for estimating the posterior distribution of the contamination factor for a given unlabeled dataset. We leverage several anomaly detectors to capture the basic notion of anomalousness and estimate the contamination using a specific mixture formulation. Empirically on 22 datasets, we show that the estimated distribution is well-calibrated and that setting the threshold using the posterior mean improves the detectors’ performance over several alternative methods

Contents and usage

The repository contains:

gammaGMM.py, a function that allows to get samples from the contamination factor's posterior distribution;
Notebook.ipynb, a notebook showing how to use gammaGMM on an artificial 2D dataset;
results, a folder that contains the samples that we obtained after running our code along with the true contamination factors;
online_supplement, a pdf with the online supplementary material.

To use gammaGMM, import the github repository or simply download the files. You can find the benchmark datasets at this [link]. Alternatively, feel free to use directly our results (i.e., the samples from the posterior) that you can find inside the results folder.

GammaGMM: Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection

Given a dataset with attributes X, an unsupervised anomaly detector assigns to each example an anomaly score, representing its degree of anomalousness. Thus, the first step of gammaGMM is to use a set of M unsupervised detectors (passed as input by the user) to transform the data into an M dimensional score space. Then, it sets a DPGMM model on this score space. Each component of the DPGMM is ordered using our proposed ordering criterium. By measuring how anomalous the components are (jointly), we derive the contamination factor's posterior.

Given a training dataset X and the user-specified hyperparameters p_0 and p_high, the code can be used as in the Notebook file.

Dependencies

The gammaGMM function requires the following python packages to be used:

Contact

Contact the author of the paper: [email protected].

References

[1] Perini, L., Burkner, P., Klami, A.: Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection. In: The Fortieth International Conference on Machine Learning (ICML) 2023.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Benchmark_Datasets		Benchmark_Datasets
Results		Results
LICENSE		LICENSE
Notebook.ipynb		Notebook.ipynb
Online_Supplement.pdf		Online_Supplement.pdf
README.md		README.md
gammaGMM.py		gammaGMM.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GammaGMM

Abstract

Contents and usage

GammaGMM: Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection

Dependencies

Contact

References

About

Releases

Packages

Languages

License

Lorenzo-Perini/GammaGMM

Folders and files

Latest commit

History

Repository files navigation

GammaGMM

Abstract

Contents and usage

GammaGMM: Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection

Dependencies

Contact

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages