Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in documentation for SpotTheDiff detector on wine quality dataset? #780

Open
vinyasHarish95 opened this issue Apr 27, 2023 · 1 comment
Labels
Type: Docs Anything related to documentation Type: Question User questions

Comments

@vinyasHarish95
Copy link

vinyasHarish95 commented Apr 27, 2023

Hi Seldon team, thanks for your great work on this package! I'm using it in my PhD research to understand the impact of different dataset shifts during COVID-19 on a precision public health model.

I was taking a look at the SpotTheDiff detector and the background docs say that "[like pre-processing steps] learned detectors are trained on training data which is held-out from the reference data set".

In the example on the same page, the PCA is trained on X_train and the MMDDrift detector is instantiated on X_ref.

However, in the wine quality example, the detector is instantiated on X_ref?
So I'm confused if there should be part of the whites dataset (an X_train) that should've been set aside to train the detector?

Thank you for clarifying.

@ojcobb
Copy link
Contributor

ojcobb commented Apr 28, 2023

Hi @vinyasHarish95,

Thanks for pointing out this potential source of confusion.

The sentence "it is important that the learned detectors are trained on training data which is held-out from the reference data set" is intended to lend intuition as to how the learned detectors work, rather than an instruction to split data before passing it to these detectors. This is because for the learned detectors the splitting is inherent to the drift detection procedure and is therefore implemented automatically inside the detectors. By contrast data splitting is only relevant to the non-learned detectors in the special case where both a preprocessing function is specified and the preprocessing function has been fit/trained using the same source of reference data. Hence in this special case the practitioner should handle the data splitting themselves.

Hope that clears things up. We'll consider whether we can make this clearer in the docs.

@jklaise jklaise added Type: Question User questions Type: Docs Anything related to documentation labels Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Docs Anything related to documentation Type: Question User questions
Projects
None yet
Development

No branches or pull requests

3 participants