Random sampling optimisation resources

Random sampling is a key part of probabilistic programming, bayesian deep learning and text generation.

For example for text generation with a 4-character model "abcd", the network might predict the next character probability with [0.1, 0.3, 0.4, 0.2] and we need to sample a character from this (multinomial) distribution.

Now imagine that we instead have a vocabulary of a hundred thousands of words to sample from and here is the need for scalable performance.

Multinomial distribution

Use case

This is a critical distribution to optimise and the one used to sample next words or characters in NLP text generation. It is also used to give more weight to certain labels during training by sampling the training set with a skewed distribution (instead of scaling the gradient by a certain weight). And it's probably also the best way to implement stratified K-Fold

What is it

It is the generalisation of the binomial distribution where for example probability of a biaised coin toss could be [0.25, 0.75].

Implementation

In research

Let's start with the main ones:

Linear search
Binary search on top of the cumulative distribution (see http://www.keithschwarz.com/darts-dice-coins/)
Alias method
Using Fenwick trees (A Scalable Asynchronous Distributed Algorithm for Topic Modeling, state-of-the-art, scalable to billions of words and parallel)

And a few curious ones:

roulette wheel selection and http://www.keithschwarz.com/darts-dice-coins/

Misc literature

ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

Numpy

For choice Numpy uses binary search on top of the cumulative distribution

    def choice(self, a, size=None, replace=True, p=None):
        """
        Parameters
        -----------
        a : 1-D array-like or int. If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a were np.arange(a)
        size : int or tuple of ints, optional. Output shape.  
        replace : boolean, optional. Whether the sample is with or without replacement
        p : 1-D array-like, optional. The probabilities associated with each entry in a. If not given the sample assumes a uniform distribution over all entries in a.
        Returns
        --------
        samples : single item or ndarray
            The generated random samples
        Examples
        ---------
        Generate a uniform random sample from np.arange(5) of size 3:
        >>> np.random.choice(5, 3)
        array([0, 3, 4])
        >>> #This is equivalent to np.random.randint(0,5,3)
        Generate a non-uniform random sample from np.arange(5) of size 3:
        >>> np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
        array([3, 3, 0])
        """

        ...

        # Actual sampling
        if replace:
            if p is not None:
                cdf = p.cumsum()
                cdf /= cdf[-1]
                uniform_samples = self.random_sample(shape)
                idx = cdf.searchsorted(uniform_samples, side='right')
                idx = np.array(idx, copy=False) # searchsorted returns a scalar
            else:
                idx = self.randint(0, pop_size, size=shape)
        else:
            if size > pop_size:
                raise ValueError("Cannot take a larger sample than "
                                 "population when 'replace=False'")

            if p is not None:
                if np.count_nonzero(p > 0) < size:
                    raise ValueError("Fewer non-zero entries in p than size")
                n_uniq = 0
                p = p.copy()
                found = np.zeros(shape, dtype=np.int)
                flat_found = found.ravel()
                while n_uniq < size:
                    x = self.rand(size - n_uniq)
                    if n_uniq > 0:
                        p[flat_found[0:n_uniq]] = 0
                    cdf = np.cumsum(p)
                    cdf /= cdf[-1]
                    new = cdf.searchsorted(x, side='right')
                    _, unique_indices = np.unique(new, return_index=True)
                    unique_indices.sort()
                    new = new.take(unique_indices)
                    flat_found[n_uniq:n_uniq + new.size] = new
                    n_uniq += new.size
                idx = found
            else:
                idx = self.permutation(pop_size)[:size]
                if shape is not None:
                    idx.shape = shape

Note there is also Numpy multinomial that uses repeated binomial sampling but that doesn't match our need: Numpy multinomial

PyTorch

PyTorch uses either the alias method or CDF + binary search

Normal distribution

TODO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly