-
Notifications
You must be signed in to change notification settings - Fork 15
Random sampling optimisation resources
Random sampling is a key part of probabilistic programming, bayesian deep learning and text generation.
For example for text generation with a 4-character model "abcd", the network might predict the next character probability with [0.1, 0.3, 0.4, 0.2] and we need to sample a character from this (multinomial) distribution.
Now imagine that we instead have a vocabulary of a hundred thousands of words to sample from and here is the need for scalable performance.
This is a critical distribution to optimise and the one used to sample next words or characters in NLP text generation. It is also used to give more weight to certain labels during training by sampling the training set with a skewed distribution (instead of scaling the gradient by a certain weight). And it's probably also the best way to implement stratified K-Fold
It is the generalisation of the binomial distribution where for example probability of a biaised coin toss could be [0.25, 0.75].
Let's start with the main ones:
- Linear search
- Binary search on top of the cumulative distribution (see http://www.keithschwarz.com/darts-dice-coins/)
- Alias method
- Using Fenwick trees (A Scalable Asynchronous Distributed Algorithm for Topic Modeling, state-of-the-art, scalable to billions of words and parallel)
And a few curious ones:
For choice
Numpy uses binary search on top of the cumulative distribution
def choice(self, a, size=None, replace=True, p=None):
"""
Parameters
-----------
a : 1-D array-like or int. If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a were np.arange(a)
size : int or tuple of ints, optional. Output shape.
replace : boolean, optional. Whether the sample is with or without replacement
p : 1-D array-like, optional. The probabilities associated with each entry in a. If not given the sample assumes a uniform distribution over all entries in a.
Returns
--------
samples : single item or ndarray
The generated random samples
Examples
---------
Generate a uniform random sample from np.arange(5) of size 3:
>>> np.random.choice(5, 3)
array([0, 3, 4])
>>> #This is equivalent to np.random.randint(0,5,3)
Generate a non-uniform random sample from np.arange(5) of size 3:
>>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
array([3, 3, 0])
"""
...
# Actual sampling
if replace:
if p is not None:
cdf = p.cumsum()
cdf /= cdf[-1]
uniform_samples = self.random_sample(shape)
idx = cdf.searchsorted(uniform_samples, side='right')
idx = np.array(idx, copy=False) # searchsorted returns a scalar
else:
idx = self.randint(0, pop_size, size=shape)
else:
if size > pop_size:
raise ValueError("Cannot take a larger sample than "
"population when 'replace=False'")
if p is not None:
if np.count_nonzero(p > 0) < size:
raise ValueError("Fewer non-zero entries in p than size")
n_uniq = 0
p = p.copy()
found = np.zeros(shape, dtype=np.int)
flat_found = found.ravel()
while n_uniq < size:
x = self.rand(size - n_uniq)
if n_uniq > 0:
p[flat_found[0:n_uniq]] = 0
cdf = np.cumsum(p)
cdf /= cdf[-1]
new = cdf.searchsorted(x, side='right')
_, unique_indices = np.unique(new, return_index=True)
unique_indices.sort()
new = new.take(unique_indices)
flat_found[n_uniq:n_uniq + new.size] = new
n_uniq += new.size
idx = found
else:
idx = self.permutation(pop_size)[:size]
if shape is not None:
idx.shape = shape
Note there is also Numpy multinomial that uses repeated binomial sampling but that doesn't match our need: Numpy multinomial
PyTorch uses either the alias method or CDF + binary search
TODO