Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reproduce the report's distribution metrics using SUPPORT #20

Open
JoeLill100 opened this issue Jul 5, 2022 · 1 comment
Open
Labels
bug Something isn't working

Comments

@JoeLill100
Copy link

Hello,

I am trying to reproduce the distribution metrics established using SUPPORT, as stated on page 13 of the SynthVAE report.

I have downloaded available code and checked that my libraries are identical to those given in the requirements.txt file. I am using Python version 3.8.0.

I have ran the following code (for both pre-processing methods) on windows in command prompt:

python scratch_vae_expts.py --pre_proc_method GMM

and:

python scratch_vae_expts.py --pre_proc_method Standard

I wasn't clear on which pre-processing method was used in the report. However, in both cases regardless, the distribution metrics that I have computed for the VAE model are different to those stated in the pdf. Please can you help suggest how to fix this? I have not modified the available code in any way. Perhaps the issue is due to seeding?

Thank you in advance.

@JoeLill100 JoeLill100 added the bug Something isn't working label Jul 5, 2022
@matthewcooper19
Copy link

matthewcooper19 commented Aug 2, 2022

Hi Joe, as part of our work using SynthVAE in the synthetic data pipeline we found similar issues in terms of reproducability. We found that any metrics that use sklearn components are not reproducible and cannot be made so without changing the sdv code.

The reason for this is that setting the numpy random seed doesn't have the scope to set the sklearn random_state when it's imported from another file. As a result any metrics that use a sklearn component with a random_state argument will not be reproducible.

Metrics such as GMLogLikelihood, detection metrics e.g. logistic regression, standard vector machine all use sklearn so will be affected by this.

Although no fix is available at the moment, I wanted to add the above to give more info around the likely cause of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants