Allow setting random state for reproducibility #59

psads-git · 2023-01-21T12:25:32Z

Dear Miles,

I have used gap-stat on a same dataset. However, the optimal number of clusters that gap-stat returns is not always the same. I guess this happens because the reference distribution is randomly generated (actually, you use numpy for that in the code). So, for reproducibility reasons, it appears reasonable to have optimalK function with an argument

random_state.

If you agree, maybe I would be able to change the code accordingly, with your directions and help.

Thanks!

The text was updated successfully, but these errors were encountered:

psads-git · 2023-01-21T12:40:31Z

Or even better: Letting the user select the number of Monte Carlo (“bootstrap”) samples. The reason is given in the documentation of R function clusGap:

The main result $Tab[,"gap"] of course is from bootstrapping aka Monte Carlo simulation and hence random, or equivalently, depending on the initial random seed (see set.seed()). On the other hand, in our experience, using B = 500 gives quite precise results such that the gap plot is basically unchanged after an another run.

milesgranger · 2023-01-21T20:09:00Z

Hi!

I suppose one could use the clusterer param to add their own callable which took a random state? But anyway, I'm open for this addition so have no strong opinions on how it ought to be done. So please feel free to open another PR and we'll see how it goes. 👍

psads-git · 2023-01-22T12:57:35Z

Dear Miles,

One can run R from inside Python, via package rpy2. Using the same dataset, R package NbClust provides consistently the same optimal number of clusters and the same value for the gap-statistic:

# R code
library(NbClust)

res <- NbClust(data_normalized, distance = "euclidean", 
              min.nc = 2, max.nc = 10, method = "kmeans", index="gap")

print(res$Best.nc)

So, I have to study the way they do that.

Have a nice Sunday!

Paulo

lebedov · 2023-07-05T19:45:30Z

Dear Miles,

I have used gap-stat on a same dataset. However, the optimal number of clusters that gap-stat returns is not always the same. I guess this happens because the reference distribution is randomly generated (actually, you use numpy for that in the code). So, for reproducibility reasons, it appears reasonable to have optimalK function with an argument

random_state.

If you agree, maybe I would be able to change the code accordingly, with your directions and help.

Thanks!

Added this functionality in #61.

* Add support for seeding RNG used for random sampling (#59). * Tweak docstring. * Don't use 'int | None' for random_state type hint because it only works on Py 3.10+. * Revert change to Cargo.lock.

lebedov added a commit to lebedov/gap_statistic that referenced this issue Jul 5, 2023

Add support for seeding RNG used for random sampling (milesgranger#59).

0bf2e7f

lebedov mentioned this issue Jul 5, 2023

Add support for seeding RNG used for random sampling (#59). #61

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow setting random state for reproducibility #59

Allow setting random state for reproducibility #59

psads-git commented Jan 21, 2023

psads-git commented Jan 21, 2023

milesgranger commented Jan 21, 2023

psads-git commented Jan 22, 2023

lebedov commented Jul 5, 2023

Allow setting random state for reproducibility #59

Allow setting random state for reproducibility #59

Comments

psads-git commented Jan 21, 2023

psads-git commented Jan 21, 2023

milesgranger commented Jan 21, 2023

psads-git commented Jan 22, 2023

lebedov commented Jul 5, 2023